Literature DB >> 22121228

ProtoNet 6.0: organizing 10 million protein sequences in a compact hierarchical family tree.

Nadav Rappoport1, Solange Karsenty, Amos Stern, Nathan Linial, Michal Linial.   

Abstract

ProtoNet 6.0 (http://www.protonet.cs.huji.ac.il) is a data structure of protein families that cover the protein sequence space. These families are generated through an unsupervised bottom-up clustering algorithm. This algorithm organizes large sets of proteins in a hierarchical tree that yields high-quality protein families. The 2012 ProtoNet (Version 6.0) tree includes over 9 million proteins of which 5.5% come from UniProtKB/SwissProt and the rest from UniProtKB/TrEMBL. The hierarchical tree structure is based on an all-against-all comparison of 2.5 million representatives of UniRef50. Rigorous annotation-based quality tests prune the tree to most informative 162,088 clusters. Every high-quality cluster is assigned a ProtoName that reflects the most significant annotations of its proteins. These annotations are dominated by GO terms, UniProt/Swiss-Prot keywords and InterPro. ProtoNet 6.0 operates in a default mode. When used in the advanced mode, this data structure offers the user a view of the family tree at any desired level of resolution. Systematic comparisons with previous versions of ProtoNet are carried out. They show how our view of protein families evolves, as larger parts of the sequence space become known. ProtoNet 6.0 provides numerous tools to navigate the hierarchy of clusters.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 22121228      PMCID: PMC3245180          DOI: 10.1093/nar/gkr1027

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

ProtoNet (1) was launched in 2002. The goal of this system was to achieve an automatic hierarchical clustering of the protein sequences space. It covered 94 000 protein sequences from Swiss-Prot. Now, almost 10 years later, our census of proteins has grown tremendously. Thus, the UniProtKB database of protein sequences (2) includes over 17 millions proteins (UniProt, August 2011) of which 0.53 million proteins form the UniProtKB/Swiss-Prot section. While the size of UniProtKB/Swiss-Prot grew from 2002 by a factor of 5 (SwissProt release 40.0, October 2001), the TrEMBL section (TrEMBL Release 18.0) went from 550 000 to 16.5 million sequences, a 30-fold increase during the same period. Notably, even in the curated high quality UniProtKB/Swiss-Prot section, only 25% of the proteins carry evidence at the protein or transcript levels, while 70% of the sequences are inferred from homology and ∼3% remain questionable and marked as predicted or even uncertain proteins. The situation with the millions of sequences from UniProtKB/TrEMBL is far less satisfying. Only 3% carry some experimental supporting evidence and the majority of sequences (74%) are only based on prediction. With this immense growth in the number of protein sequences, it is clear that only unsupervised methods can cope with this data set. We need algorithms that can automatically trace the functional and evolutionary relatedness among protein sequences (3). Assigning biological functions to proteins is a major obstacle and a challenging task (4,5). Despite important progress in structural genomics, enzyme classifications and phylogenomics, the goal of automatic functional inference is far from being reached (3,6–8). Numerous motif recognition algorithms, statistical model-based and clustering methods were developed during the last two decades for the purpose of handling the growing number of sequences. These methods differ in their coverage, the level of manual curation involved and even in the basic definition of a domain family. For example, Pfam (9), SMART (10), EVEREST (11), PANTHER (12) and Gene3D (13) are based on thousands of profile Hidden Markov Models (profile HMMs). New sequences that pass a pre-determined threshold of similarity are assigned to the corresponding model domain family. Additional resources are based on algorithms that search for signature, regular expressions or Position Specific Scoring Matrix (PSSM) fingerprints. Representative databases that follow this paradigm include PROSITE (14), PRINTS (15), ProDom (16), BLOCKS (17). The above resources are based on sequence data. In addition, integrative resources such as PIRSF (18), CDD (19) and InterPro (20) take a different approach to the end of attaining higher coverage of the protein space. They accomplish this by merging a variety of external sources with a focus on protein families, domains and functional sites. The classifications of SCOP (21), CATH (22) and SUPERFAMILY (23) rest on 3D-structural information. A functional perspective is offered by the ontology-based resource of Gene Ontology (GO) (24). The available data is highly redundant, which creates a major difficulty in this area. Thus the main archive of UniProt database contains 25 million sequences (25) which represent about 17 million unique proteins. The UniRef50 with only 4 million sequence is created by grouping together proteins with >50% identical amino acids. However, in order to study sequence homologies and the evolution of protein families, they must be viewed at a much finer level of granularity. In order to deal with the enormous number of known protein sequences, ProtoNet 6.0 generates automatically, with no supervision a consistent classification tree. This system covers over 9 million proteins from UniProtKB. To address the expected future growth in the number of protein sequences, the system is equipped with a protocol for maintenance and updating. A system-provided confidence parameter quantifies the quality of every cluster in ProtoNet 6.0. Additional tools for analysis and visualization enhance the user's navigation options through the ProtoNet tree. These tools provide a rich biological context for the observed parts of the tree. We describe here the newly introduced capabilities and improvements compared with the previous version (26) where one million proteins were classified (1 072 911 sequences, UniProt Release, February 2005, ProtoNet Version 4.0).

PROTEIN SEQUENCES DATABASE

All database sources used in ProtoNet 6.0 has been thoroughly updated. The most critical aspect is the use of UniRef50 clusters as our basic objects. On average a UniRef50 cluster contains four proteins. Thus, the 2 478 328 UniRef50 proteins that are included in ProtoNet 6.0 represent over 9 million sequences. In comparison the number of protein sequences in ProtoNet 4.0 is 1 072 911.

PROTONET TREE CONSTRUCTION

The basic algorithm of ProtoNet was previously described (1,27). It starts by pre-calculating an all-against-all BLAST similarity score (28) for all protein representatives from the UniRef50 resource (called cluster seed proteins). The similarities’ E-scores were used to produce a continuous hierarchical bottom–up clustering process. At each step, the two most similar protein clusters are joined [the exact algorithm is described (29)]. Importantly, BLAST E-score with an extremely relaxed threshold is considered throughout the ProtoNet construction (E-score = 100). The bottom–up agglomerative clustering of the ProtoNet algorithm benefits from such relaxed E-score distances in constructing a robust family tree. A key ingredient of ProtoNet 6.0 that is essential for handling such a large number of proteins is the Constrained Memory-ProtoNet algorithm (29). The result is a hierarchy of protein clusters at various degrees of biological granularity. This hierarchy is structured as a collection of trees that forms what we call ProtoNet Tree (actually it is a ProtoNet forest). The root clusters contain all the proteins of the tree while other clusters represent subdivisions of proteins into smaller groups. The hierarchical definitions allow the user to navigate from a protein to the sub-family and the super-family levels in order to discover specific functions and evolutionary signals.

THE HIERARCHY'S QUALITY

The entire protocol to construct ProtoNet is unsupervised and therefore no annotations are included. However, measuring the correspondence between a given cluster and specific annotations that are provided by external expert systems is essential for the supervised validation of the automatically generated ProtoNet clusters. We thus define the notion of a correspondence score (CS). The CS for a specific cluster and a given keyword is a measure of correlation between two. Formally, let us fix a cluster C in the ProtoNet tree and a keyword K (from a specific source such as InterPro). Let c be the set of proteins in cluster C and let k be the set of proteins in the system annotated by keyword K. We define the CS as: The cluster receiving the maximal score for keyword K (called K's best cluster) is considered the cluster that best represents K within the ProtoNet tree. The score for a given cluster on keyword K ranges from 0 (no correspondence) to 1 (the cluster C is comprised of all the proteins with keyword K). The CS values are used as a quality measure for the ProtoNet tree. For example, we may consider the distribution of CS value over all ProtoNet clusters or over clusters of size that exceeds some cutoff threshold. In order to obtain a biologically relevant view of the hierarchy, we applied several tests that allow us to focus only on the clusters that are enriched with some coherent biological information. The main algorithmic difference between ProtoNet 6.0 and the earlier version ProtoNet 4.0 (26) is the use of CM-ProtoNet (29). We refined the clusters’ quality test by evaluating the CM-ProtoNet method over a single-linkage performance [that is implemented in ClusTr (30)]. The tests were carried out on 3.2 million proteins from ProtoNet 5.1 (Table 1). In addition, we tested the impact of selecting UniRef50 as cluster seed proteins for ProtoNet. It can be seen that CM-ProtoNet outperforms the other methods that were applied to the same set of proteins. Notice that the main improvement of MC-ProtoNet comes from enhanced sensitivity. The performance of the Single linkage algorithm drops drastically due to a low sensitivity. We tested three choices of cluster seeds: UniRef50 representatives (the choice that we finally adopted), UniRef90 (proteins sharing >90% sequence identity) and the complete redundant protein sequences. It is remarkable that the quality of clustering with respect to all three choices remains essentially unchanged.
Table 1.

Clustering performance evaluation based on Pfam keywords

DatabaseClusteringCSSpecificitySensitivity
UniRef90MC-ProtoNet0.890.960.92
Single Linkage0.780.930.24
ProtoNet 4.00.750.940.79
UniRef50MC-ProtoNet0.880.960.91
Single Linkage0.720.910.79
SwissProtMC-ProtoNet0.900.960.94
Single Linkage0.810.900.91

Tests were performed on UniRef90 (1.8M), UniRef50 (960 K) and SwissProt (220 K)

Clustering performance evaluation based on Pfam keywords Tests were performed on UniRef90 (1.8M), UniRef50 (960 K) and SwissProt (220 K) The same tests with respect to a set of keywords from Pfam Clan (9) validated the high performance of the MC-ProtoNet algorithm over other clustering methods (not shown). We confirmed that the protocol that was applied to construct the ProtoNet 6.0 produces a stable tree with a collection of biologically coherent families and super-families.

SELECTING STABLE CLUSTERS

The ProtoNet tree is huge, and the immense number of its protein clusters makes it quite impractical to navigate the tree. In order to deal with this difficulty, we pruned the tree. The basic idea is that many clusters that are created along the process of generating the tree are biologically irrelevant and uninteresting. For example, a root cluster in the ProtoNet forest typically contains thousands of unrelated proteins. A process of repeated pair-wise merging yields a tree of size roughly twice the number of leaves (see illustration in Figure 1A). Therefore, starting with the 2.5 million UniRef50 seed proteins we obtain 5.0 million clusters. We applied several computational procedures that are aimed at reducing this number. Our aim is to simplify the navigation in the system while maintaining the hierarchical structure and with essentially no loss in clusters’ quality.
Figure 1.

ProtoNet clusters following pruning at selected thresholds. (A) A scheme of the binary tree following low and high condensations (LT ≥ x and LT ≥ y). The high level of compression (LT = 5) results in a smaller number of stable clusters. (B) Each panel represents a cluster summary according to a selected threshold (LT). Low (LT = 0.2) and high condensation level (LT = 5) differ in their cluster size and other statistical properties. Details on the cluster size, depth (by PL), the number of hypothetical proteins, solved structures in the PDB database and more are shown.

ProtoNet clusters following pruning at selected thresholds. (A) A scheme of the binary tree following low and high condensations (LT ≥ x and LT ≥ y). The high level of compression (LT = 5) results in a smaller number of stable clusters. (B) Each panel represents a cluster summary according to a selected threshold (LT). Low (LT = 0.2) and high condensation level (LT = 5) differ in their cluster size and other statistical properties. Details on the cluster size, depth (by PL), the number of hypothetical proteins, solved structures in the PDB database and more are shown. To this end, we sought intrinsic parameters of ProtoNet that measure the stability of a cluster. One such parameter is Life Time (LT), which is the difference between the time (i.e. merging steps) in which a cluster is created and the time it is merged to a larger cluster. This number reflects the relative height of a cluster in the merging tree. The level of the tree (called ProtoLevel, PL) is used as an internal monotonic timer for merging, along the clustering process (which is reflected by the index of the cluster, Figure 1A). Individual protein sequences have PL = 0 and for the root of the ProtoNet tree PL = 100. The idea is that stable clusters tend to be more relevant biologically. We thus used a tradeoff between the number of clusters that are retained and the reduction in the performance of the clusters, measured by the average of the CS for all clusters. A minimal reduction in the average CS score for the InterPro keyword annotations was attained for LT < 1.0. We thus set the LT = 1.0 as a default parameter (see ‘Advanced Navigation’). Figure 1 illustrates the pruning process at different LT cutoffs (marked x, y). Evidently, fewer valid clusters (colored red) remain as LT is increased. Figure 1B shows a cluster summary at different LT cut-offs. Note that the statistical parameters of the analyzed clusters depend on the choice of LT values. The pruned version of the ProtoNet 6.0 tree at a LT = 1.0 and PL = 90 has 162 088 high quality stable clusters. With these parameters the original number of 5 million clusters (including leaves) is reduced by a factor of about 30.

ANNOTATION INFERENCE

Functional inference in ProtoNet 6.0 is done by an automatic high-confidence method that infers the functional annotation of a cluster by integrating the annotations of its individual proteins. The method builds on functional annotations from multiple resources including InterPro, GO (24), UniProt keywords (2), ENZYME (31) and more. We consider all the annotations that cover >1% of the proteins and focus on those that best fit the proteins of the cluster. Evidently, automatic inference cannot be error-free. Thus, a predetermined specificity threshold is calculated for the keywords associated with the cluster's proteins. Such annotation is assigned as the ProtoName (Figure 2). To avoid faulty inference, we calculated ProtoName for clusters in which >20% of the proteins share the specific annotation where this annotation shows an enrichment of P-value < 0.005. Recall that presenting additional names for a cluster often hints at a novel overlooked function or the presence of multi-domain proteins that exhibit multi-functionality.
Figure 2.

The contribution of annotation types to ProtoNet clusters. (A) About 40 annotation types that cover different aspects of function are included. Some of the minor annotation sources were combined and depicted as ‘others’. (B) The major annotation types and their coverage as measured by the fraction of proteins that are assigned with the indicated annotation type are listed. In ProtoNet 6.0, a total of 143 849 828 annotations (74 416 565 without taxonomy) is associated with the ∼9 million protein sequences.

The contribution of annotation types to ProtoNet clusters. (A) About 40 annotation types that cover different aspects of function are included. Some of the minor annotation sources were combined and depicted as ‘others’. (B) The major annotation types and their coverage as measured by the fraction of proteins that are assigned with the indicated annotation type are listed. In ProtoNet 6.0, a total of 143 849 828 annotations (74 416 565 without taxonomy) is associated with the ∼9 million protein sequences. Each of the ∼162 000 stable clusters was assigned a ProtoName. On average, a cluster is associated with 9.7 possible names. Most names are derived from Taxonomy (33%), UniProt (19%), GO (18%), InterPro (17%) and the rest includes information from structural classifications [e.g. SCOP (21) and CATH (22)] or ENZYME-based annotations (31). A partition of the unique clusters according to their annotation types is shown (Figure 2). Notably, most annotation types contribute to some ProtoName. This suggests that the integration of knowledge from diverse annotation sources substantially improves the performance of the ProtoNet tree.

GENOMIC VIEW ON PROTEIN CLUSTERS

A huge number of organisms are represented in UniProtKB (Figure 2B). Still, a third of the protein sequences originate from a relatively small number of organisms that were completely sequenced. A substantial number of all these sequences (mostly from multi-cellular organisms) also serve as genetic model organisms. Therefore, we included a selected list of over 30 organisms on which the user can choose to focus. These organisms represent all superkingdoms.

WEBSITE PROPERTIES

Several added features in the ProtoNet 6.0 website make it easier to reach an in-depth analysis of the ProtoNet tree. We describe these new features in ‘simplified mode’ and in ‘advanced mode’ (Figure 3).
Figure 3.

ProtoNet cluster page and a tree viewer in simplified and advanced modes. (Top) From the cluster page (Cluster ID 4201544) the user can focus on the ProtoName and the collection of additional high quality annotations that are associated with this cluster. The number of proteins from the selected organisms is indicated with a framed T-symbol (for Taxonomy). Similarity, clusters that include proteins with 3D solved structures as marked by a symbol for PDB. Each cluster provides a short summary as a popup box with the number of proteins and the appearance of pre-selected organisms. The red edges in the tree indicate the branches that include the selected organisms. All other branched are faded. (Bottom) Using the advanced mode, the number of clusters in the ProtoNet tree is listed according to the predetermined LT and PL values. There are several sorting options according to the cluster size and the properties of the tree. An interactive use of the condensation levels allows inspecting the near vicinity of a subjected cluster in the ProtoNet hierarchy.

ProtoNet cluster page and a tree viewer in simplified and advanced modes. (Top) From the cluster page (Cluster ID 4201544) the user can focus on the ProtoName and the collection of additional high quality annotations that are associated with this cluster. The number of proteins from the selected organisms is indicated with a framed T-symbol (for Taxonomy). Similarity, clusters that include proteins with 3D solved structures as marked by a symbol for PDB. Each cluster provides a short summary as a popup box with the number of proteins and the appearance of pre-selected organisms. The red edges in the tree indicate the branches that include the selected organisms. All other branched are faded. (Bottom) Using the advanced mode, the number of clusters in the ProtoNet tree is listed according to the predetermined LT and PL values. There are several sorting options according to the cluster size and the properties of the tree. An interactive use of the condensation levels allows inspecting the near vicinity of a subjected cluster in the ProtoNet hierarchy.

Browsing cluster names

Cluster names are now available for browsing. One can choose a keyword of interest and view clusters that are named by it. Note that a keyword of low statistical significance will be absent in ProtoName. Figure 2 shows the contribution of the major annotation resources that are included in determining the ProtoName.

Hypothetical and putative proteins

The assignment of a biological function to clusters suggests a safe scheme of assigning function to proteins with unknown function. Naively, the protein can be assigned the function of all clusters to which it belongs. This can be applied for ‘hypothetical’ and ‘putative’ proteins within the clusters. It can also be used for a new user-provided protein sequence (with the ‘Classify your protein’ option). We provide a list of all the proteins that are marked as hypothetical and putative proteins in the summary table (Figure 3).

ProtoNet tree resolution

Following the pruning process described above, ProtoNet is no longer a binary tree. To cope with this non-binary condensed version, we introduced the ProtoBrowser page that zooms in on the tree only in the vicinity of the cluster that is being analyzed. A selected branch is shown in the context of related neighboring branches. The user hovers the mouse over a cluster to display essential information such as the cluster size, the number of proteins according to selected species (if a ‘genomic view’ was activated). An example of such ProtoBrowser tree views is shown (Figure 3).

Integration of annotation sources

The functional analysis of a cluster is performed using PANDORA (32) visualization, which allows in-depth analysis of large protein sets. The system allows direct export from the cluster page to PANDORA. Using PANDORA it is possible to assess the functional relevance of the proteins in the clusters from numerous biological aspects. The annotation sources used by PANDORA were updated, and now offer ∼200 000 different annotations, spanning several different biological domains. Specifically, PANDORA extracts most of the annotations from UniProtKB. For structural annotations CATH (22), SCOP and Gene3D (13) are considered. The functional domain is covered by the four layers of the ENZYME classification (31) and the GO structure with the three main functional branches: cellular component, biochemical function and biological process (24). The protein families are forwarded to PANDORA analysis tool that statistically analyzes a given cluster by means of the annotations that are assigned to its proteins (32). On average, each protein sequence in ProtoNet is associated with 6.6 different annotation types (11 and 10 annotations for human and mouse, respectively). PANDORA supports also each of the dozen domain and family resources of the InterPro collection. In a typical application of PANDORA the user concentrates on any of the 200 000 annotations with the query ‘Get clusters containing proteins with a given keyword’ (e.g. InterPro domain: GTPase-binding/formin homology 3). In response, one receives an integrated view of all proteins that are associated with this annotation, not only those that belong to the UniRef50 seed proteins (see below).

Expanded proteins

The ProtoNet tree is started with the representative proteins of UniRef50. The cluster view offers a list of the proteins of the cluster. Two levels of expansion are provided: the list of proteins according to the UniRef representatives and the complete UniProtKB list. On average, the passage from UniRef50 to UniRef90 and from UniRef90 to the UniProtKB full list results in a 2.5-fold and 1.8-fold expansion, respectively. Cluster A4686503 contains 487 proteins that have a mamimal CS for the keyword Cadherins of CATH homologous superfamily (CS = 0.767). This cluster is expanded to 2349 proteins. Similarly, the expanded list of proteins can be conveniently viewed via PANDORA (see ‘Integration of Annotation Sources’). For example, 557 proteins in the ProtoNet 6.0 database are annotated Cadherins according to the CATH homologous superfamily, but using PANDORA, this list is expanded to a total of 2298 proteins.

Phylogenetic tree viewer

The user can select one or several organisms and have the branches in the ProtoNet tree that include the selected organisms highlighted. Navigation through the selection of complete proteomes is illustrated in Figure 3. It is shown for a few selected mammals (human, mouse, rat). Only branches that include proteins from the selected organisms become visible, though all ‘faded’ clusters can still be analyzed. In Figure 3, the indicated cluster (Cluster ID 4201544) contains 310 proteins. The number of proteins that is covered by the selected proteins is listed (Figure 3). At any stage the user can reset or remove or change the taxonomical based selection.

Comparing versions

The user may select to navigate each of the main releases of ProtoNet. Maintenance of the different versions allows assessing the changes in the clusters along the constant growth in protein sequences. For example, with the same threshold of PL = 90 and LT = 1 there are 5245 and 74 446 stable clusters by ProtoNet versions of SwissProt 40.28 and UniProt 8.1, respectively.

Advanced navigation

The advance mode provides additional control for the user on the parameters of the visualization that concern: (i) the ProtoBrowser and (ii) ProtoNet tree condensation. The user can choose to activate the ProtoBrowser at a different resolution. While the simplified mode (Figure 3, upper panel) shows two levels above and below the observed cluster (marked in red font in the tree, Figure 3), in ‘advanced mode’, the number of presented surrounding tree layers is a user-selected parameter. By moving up the tree, one observes how the cluster grows in size and becomes more diverse. The user can change the tree resolution by modifying the parameters of the tree condensation protocol (see ‘Selecting Stable Clusters’). Such change of parameters turns a binary tree to a non-binary tree, and some browsing options help the user in following this modification. Other capacities of the ‘advanced mode’ reflect certain intrinsic properties of the ProtoNet Tree. The user can retrieve the ProtoNet clusters at a specific PL (Figure 3, lower panel). This determines the number of clusters to be presented but it also (indirectly) allows the user to focus on a PL that is maximally enriched by proteins with unknown function. While a careful biological interpretation of the ProtoNet 6.0 clusters is beyond the scope of this paper, we should note a significant explosion of proteins of unknown function that appears at PL > 90. Additional queries address the connectivity of selected proteins in the tree. In particular, one can get the lowest common cluster of any two proteins. Search for the appearance of a specific protein within a cluster, search for all the clusters that are associated with a selected keyword and more.

A TEST CASE—METAGENOME TO FUNCTION

Global Ocean Sampling (GOS) sequences is a huge collection of (mostly) unidentified marine metagenome sequences that covers nearly all known prokaryotic protein families (33). We now illustrate a test case of one of hypothetical protein GOS_6351915. Applying the ProtoNet option ‘Paste your new sequence’ in basic mode with default parameters finds this sequence in cluster 4033656 (26 proteins, 5 named ‘predicted protein’ and additional 2 proteins named ‘putative’) all of which belong to InterPro entry of ‘Longin’ and to additional keywords that specifies the relevance to SNARE-like (based on SCOP). However, upon moving up the tree to a larger cluster with 107 proteins (Cluster ID 4312270), the dominating keyword (ProtoName) is changed to InterPro IPR016444: Synaptobrevin that metazoa/fungi. The taxonomy of the merged cluster includes only metazoa and fungi (excluding green plans). Activating the ‘advanced mode’ for a condensed tree (LT threshold = 10) indicates that GOS_6351915 sequence belongs to a larger cluster (213 proteins, Root) where the most significant annotation (Cluster ID 4446624 and CS = 0.965) is from CATH topology of Beta-Lactamase and homologous group of CATH 3.30.450.50. Analyzing this very stable cluster via PANDORA shows that the dominating features are membrane and coiled coil. The significant P-value for other functional annotations such as v-SNARE, trafficking, synaptic vesicles, ER and golgi confirm that GOS_6351915 sequence is a genuine member of the SNARE family. We postulate with high confidence that this sequence is a Synaptobrevin-like protein that is probably derived from the unicellular species of marine-centric diatom.

MAINTENANCE AND UPDATING

ProtoNet will be updated once a year. A partition in UniProt to the sections of UniProt/Swiss-Prot and UniProt/TrEMBL will be implemented. This will allow users to focus, as needed, on each UniProt section, separately. Future ProtoNet releases will incorporate additional annotation resources from KEGG, STRING, OMIM and GO evidence codes. To provide the user with control over the confidence level, annotations evidence (e.g. experimental, inferred) will be added for each protein in our database. ProtoNet 6.0 had also incorporated few fundamental technical improvements in the automation, database design and technologies. These improvements concern the automation for the future updates and releases.

FUNDING

Sudarsky Center for Computational Biology (SCCB) fellowship (to N.R.); EU FRVII Prospects consortium and the Israel Science Foundation ISF 592/07. Funding for open access charge: EU FRVII Prospects consortium and the Israel Science Foundation ISF 592/07. Conflict of interest statement. None declared.
  33 in total

1.  Increased coverage of protein families with the blocks database servers.

Authors:  J G Henikoff; E A Greene; S Pietrokovski; S Henikoff
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  A novel method for automatic functional annotation of proteins.

Authors:  W Fleischmann; S Möller; A Gateau; R Apweiler
Journal:  Bioinformatics       Date:  1999-03       Impact factor: 6.937

3.  PIRSF: family classification system at the Protein Information Resource.

Authors:  Cathy H Wu; Anastasia Nikolskaya; Hongzhan Huang; Lai-Su L Yeh; Darren A Natale; C R Vinayaka; Zhang-Zhi Hu; Raja Mazumder; Sandeep Kumar; Panagiotis Kourtesis; Robert S Ledley; Baris E Suzek; Leslie Arminski; Yongxing Chen; Jian Zhang; Jorge Louie Cardenas; Sehee Chung; Jorge Castro-Alvear; Georgi Dinkov; Winona C Barker
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

4.  PRINTS and its automatic supplement, prePRINTS.

Authors:  T K Attwood; P Bradley; D R Flower; A Gaulton; N Maudling; A L Mitchell; G Moulton; A Nordle; K Paine; P Taylor; A Uddin; C Zygouri
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

5.  ProtoNet: hierarchical classification of the protein space.

Authors:  Ori Sasson; Avishay Vaaknin; Hillel Fleischer; Elon Portugaly; Yonatan Bilu; Nathan Linial; Michal Linial
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

6.  Automated prediction of protein function and detection of functional sites from structure.

Authors:  Florencio Pazos; Michael J E Sternberg
Journal:  Proc Natl Acad Sci U S A       Date:  2004-09-29       Impact factor: 11.205

Review 7.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

8.  The Gene3D Web Services: a platform for identifying, annotating and comparing structural domains in protein sequences.

Authors:  Corin Yeats; Jonathan Lees; Phil Carter; Ian Sillitoe; Christine Orengo
Journal:  Nucleic Acids Res       Date:  2011-06-06       Impact factor: 16.971

9.  ProtoNet 4.0: a hierarchical classification of one million protein sequences.

Authors:  Noam Kaplan; Ori Sasson; Uri Inbar; Moriah Friedlich; Menachem Fromer; Hillel Fleischer; Elon Portugaly; Nathan Linial; Michal Linial
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

10.  The ProDom database of protein domain families: more emphasis on 3D.

Authors:  Catherine Bru; Emmanuel Courcelle; Sébastien Carrère; Yoann Beausse; Sandrine Dalmar; Daniel Kahn
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  20 in total

1.  Evaluation of a hypothetical protein for serodiagnosis and as a potential marker for post-treatment serological evaluation of tegumentary leishmaniasis patients.

Authors:  Mariana Pedrosa Lima; Lourena Emanuele Costa; Mariana Costa Duarte; Daniel Menezes-Souza; Beatriz Cristina Silveira Salles; Thaís Teodoro de Oliveira Santos; Fernanda Fonseca Ramos; Miguel Angel Chávez-Fumagalli; Amanda Christine Silva Kursancew; Roberta Passamani Ambrósio; Bruno Mendes Roatt; Ricardo Andrez Machado-de-Ávila; Denise Utsch Gonçalves; Eduardo Antonio Ferraz Coelho
Journal:  Parasitol Res       Date:  2017-02-01       Impact factor: 2.289

Review 2.  Protein Bioinformatics Databases and Resources.

Authors:  Chuming Chen; Hongzhan Huang; Cathy H Wu
Journal:  Methods Mol Biol       Date:  2017

3.  Identification of a Novel Mucin Gene HCG22 Associated With Steroid-Induced Ocular Hypertension.

Authors:  Shinwu Jeong; Nitin Patel; Christopher K Edlund; Jaana Hartiala; Dennis J Hazelett; Tatsuo Itakura; Pei-Chang Wu; Robert L Avery; Janet L Davis; Harry W Flynn; Geeta Lalwani; Carmen A Puliafito; Hussein Wafapoor; Minako Hijikata; Naoto Keicho; Xiaoyi Gao; Pablo Argüeso; Hooman Allayee; Gerhard A Coetzee; Mathew T Pletcher; David V Conti; Stephen G Schwartz; Alexander M Eaton; M Elizabeth Fini
Journal:  Invest Ophthalmol Vis Sci       Date:  2015-04       Impact factor: 4.799

4.  ProtoNet: charting the expanding universe of protein sequences.

Authors:  Nadav Rappoport; Nathan Linial; Michal Linial
Journal:  Nat Biotechnol       Date:  2013-04       Impact factor: 54.908

Review 5.  The use of evolutionary patterns in protein annotation.

Authors:  Angela D Wilkins; Benjamin J Bachman; Serkan Erdin; Olivier Lichtarge
Journal:  Curr Opin Struct Biol       Date:  2012-05-24       Impact factor: 6.809

6.  Integrating in silico resources to map a signaling network.

Authors:  Hanqing Liu; Tim N Beck; Erica A Golemis; Ilya G Serebriiskii
Journal:  Methods Mol Biol       Date:  2014

7.  Structure-based function analysis of putative conserved proteins with isomerase activity from Haemophilus influenzae.

Authors:  Mohd Shahbaaz; Faizan Ahmad; Md Imtaiyaz Hassan
Journal:  3 Biotech       Date:  2014-12-28       Impact factor: 2.406

8.  Structure-based functional annotation of putative conserved proteins having lyase activity from Haemophilus influenzae.

Authors:  Mohd Shahbaaz; Faizan Ahmad; Md Imtaiyaz Hassan
Journal:  3 Biotech       Date:  2014-06-17       Impact factor: 2.406

9.  Functional inference by ProtoNet family tree: the uncharacterized proteome of Daphnia pulex.

Authors:  Nadav Rappoport; Michal Linial
Journal:  BMC Bioinformatics       Date:  2013-02-28       Impact factor: 3.169

10.  ProtoBug: functional families from the complete proteomes of insects.

Authors:  Nadav Rappoport; Michal Linial
Journal:  Database (Oxford)       Date:  2015-04-24       Impact factor: 3.451

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.