Literature DB >> 22962345

Pythoscape: a framework for generation of large protein similarity networks.

Abstract

Pythoscape is a framework implemented in Python for processing large protein similarity networks for visualization in other software packages. Protein similarity networks are graphical representations of sequence, structural and other similarities among proteins for which pairwise all-by-all similarity connections have been calculated. Mapping of biological and other information to network nodes or edges enables hypothesis creation about sequence-structure-function relationships across sets of related proteins. Pythoscape provides several options to calculate pairwise similarities for input sequences or structures, applies filters to network edges and defines sets of similar nodes and their associated data as single nodes (termed representative nodes) for compression of network information and output data or formatted files for visualization.

Entities: Chemical Gene

Mesh：

Substances：

Year: 2012 PMID： 22962345 PMCID： PMC3476340 DOI： 10.1093/bioinformatics/bts532

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

The rapid growth of databases of protein information (e.g. sequences and structures) provides both new opportunities and challenges for analysis and clustering by similarity. For example, global analysis of entire superfamilies and association of their members with biological information and other types of metadata has become a useful tool for functional annotation and discovery (Brown and Babbitt, 2012). As these sets become larger (sometimes many thousands of sequences) and their members more divergent, their fast exploration on a large-scale becomes less feasible using traditional approaches such as alignments and trees. Protein similarity networks (PSNs) enable analysis and visualization of structure–function relationships in large protein data sets by clustering of individual protein sets for more complex analysis while summarizing ‘connectivity’ relationships among the clusters. Mapping orthogonal sources of biological information onto PSNs then provides a powerful way to view functional trends across the set that can be interpreted in the context of their similarities. (See Atkinson for an initial analysis of some uses and statistical validation of PSNs.) While databases like Similarity Matrix of Proteins (SIMAP) (Rattie ) store pairwise similarities, and plug-ins available with software such as Cytoscape (Smoot ) allow creation of small PSNs (Wittkop ), no software solution exists to create and manage large PSNs. And while PSNs are inherently amenable to association with orthogonal information sources, the many information types available complicate development of a single software solution for managing such diverse features. Pythoscape addresses these issues and provides a software framework to create PSNs and develop new analyses for inference of functional properties in proteins.

2 DESCRIPTION AND SIGNIFICANCE

Pythoscape is an extensible computational framework implemented in Python to generate and analyze PSNs. For the user interested in generating large networks, the Pythoscape package has a core set of plug-ins (Supplementary Table S1) and tutorials, so that no development is needed to create simple networks painted with useful metadata. For software developers, Pythoscape provides a framework for rapid modification along with well-documented application programming interfaces for development of additional plug-ins using new sources of metadata. Unlike sparser networks such as interaction networks, PSNs are frequently close to complete, often requiring storage and management of large quantities of data, and fast calculation (Supplementary Table S2). Pythoscape allows for flexible storage of data through the use of storage interfaces. Appropriate storage solutions can be chosen based on network size or developed as needed allowing for easy updating for faster and more reliable database software solutions. Pythoscape can create, store and manage large networks, then, using representative nodes and edges to compress the information, output smaller summary networks for visualization (Fig. 1A and B). Users can choose how distances between representative nodes are calculated and, importantly, the full set of sequences in each node is retained for later use.

Fig. 1.

Sequence similarity network of the GST superfamily generated by Pythoscape and visualized in Cytoscape. To compact the view for this figure, networks were layed out using the organic layout in Cytoscape rather than the distances computed from a similarity metric. In all, 664 representative nodes are used to describe pairwise relationships among 7447 sequences. (A) Representative network with functional classes colored, if annotated by SwissProt in a family (The UniProt Consortium, 2011). Family membership is indicated if one or more sequences in the abstracted node are associated with that family. (B) Full non-abstracted network for the group of GSTs found mostly in eukaryotes (boxed in A)

3 EXAMPLE USAGE

Glutathione transferases (GSTs) are enzymes that typically catalyze the addition of glutathione to substrate compounds. They play roles in many biological processes, including metabolism of endogenous compounds and xenobiotics such as drugs. Of the thousands of GSTs that have been identified, the physiological substrates of only a small proportion are known; thus, they are principally classified into putative functional classes according to enzymatic, structural, and other features (Mannervik and Danielson, 1988). Recently, PSNs have been used to summarize and guide a global interpretation of GST sequence and structure relationships (Atkinson and Babbitt, 2009). A PSN of GST sequences is shown in Figure 1A (see supplementary information for network creation and graph statistics). It illustrates how representative nodes computed by Pythoscape enable analysis of PSNs too large to be visualized in total while retaining their value for developing hypotheses from sequence similarities across the whole set. For comparison, individual clusters of interest can be outputted with all nodes present (Fig. 1B). This full non-abstracted network (representing a node for each sequence) shows a similar pattern of relationships to those shown in the corresponding representative node network (boxed in Fig. 1A). The correlation between the ideal representative node mean distances calculated in Pythoscape and the corresponding full network ideal distance for Fig. 1A is provided in Supplementary Figure S1. A quantitative description of the relationships between filtered networks and full networks has also recently been described elsewhere for some example systems (Atkinson ), but these differences appear also to depend on the specific system analyzed. While ‘missing data’ is an inherent feature of representative nodes, the trade-off is in visualizing similarity relationships across large datasets that would not be practically achievable because of memory and speed limitations in their calculation. The network shown in Figure 1A demonstrates another issue in the use of representative nodes that could complicate interpreting relationships between functional features and sequence similarity. In the example given here, some GST families are represented by multiple representative nodes, whereas other representative nodes contain multiple SwissProt families (HSP26, Phi and Tau), obscuring how sequence similarity tracks with annotation. Thus, we recommend that analysis using representative networks be accompanied by examination of the relevant parts of the corresponding full networks.

4 CONCLUSION

Pythoscape is a software framework to efficiently create and manage protein similarity networks. Tutorials, Pythoscape documentation, source code and future development plans are available at http://www.rbvi.ucsf.edu/trac/Pythoscape.

8 in total

Review 1. Inference of functional properties from large-scale analysis of enzyme superfamilies.

Authors: Shoshana D Brown; Patricia C Babbitt
Journal: J Biol Chem Date: 2011-11-08 Impact factor: 5.157

2. Comprehensive cluster analysis with Transitivity Clustering.

Authors: Tobias Wittkop; Dorothea Emig; Anke Truss; Mario Albrecht; Sebastian Böcker; Jan Baumbach
Journal: Nat Protoc Date: 2011-02-10 Impact factor: 13.491

Review 3. Glutathione transferases--structure and catalytic activity.

Authors: B Mannervik; U H Danielson
Journal: CRC Crit Rev Biochem Date: 1988

4. Glutathione transferases are structural and functional outliers in the thioredoxin fold.

Authors: Holly J Atkinson; Patricia C Babbitt
Journal: Biochemistry Date: 2009-11-24 Impact factor: 3.162

5. Using sequence similarity networks for visualization of relationships across diverse protein superfamilies.

Authors: Holly J Atkinson; John H Morris; Thomas E Ferrin; Patricia C Babbitt
Journal: PLoS One Date: 2009-02-03 Impact factor: 3.240

6. Cytoscape 2.8: new features for data integration and network visualization.

Authors: Michael E Smoot; Keiichiro Ono; Johannes Ruscheinski; Peng-Liang Wang; Trey Ideker
Journal: Bioinformatics Date: 2010-12-12 Impact factor: 6.937

7. Reorganizing the protein space at the Universal Protein Resource (UniProt).

Authors:
Journal: Nucleic Acids Res Date: 2011-11-18 Impact factor: 16.971

8. SIMAP--a comprehensive database of pre-calculated protein sequence similarities, domains, annotations and clusters.

Authors: Thomas Rattei; Patrick Tischler; Stefan Götz; Marc-André Jehl; Jonathan Hoser; Roland Arnold; Ana Conesa; Hans-Werner Mewes
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

8 in total

34 in total

1. Pclust: protein network visualization highlighting experimental data.

Authors: Wenlin Li; Lisa N Kinch; Nick V Grishin
Journal: Bioinformatics Date: 2013-08-05 Impact factor: 6.937

2. Consequences of domain insertion on sequence-structure divergence in a superfold.

Authors: Chetanya Pandya; Shoshana Brown; Ursula Pieper; Andrej Sali; Debra Dunaway-Mariano; Patricia C Babbitt; Yu Xia; Karen N Allen
Journal: Proc Natl Acad Sci U S A Date: 2013-08-19 Impact factor: 11.205

3. The EFI Web Resource for Genomic Enzymology Tools: Leveraging Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways.

Authors: Rémi Zallot; Nils Oberg; John A Gerlt
Journal: Biochemistry Date: 2019-10-04 Impact factor: 3.162

Review 4. Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks.

Authors: John A Gerlt; Jason T Bouvier; Daniel B Davidson; Heidi J Imker; Boris Sadkhin; David R Slater; Katie L Whalen
Journal: Biochim Biophys Acta Date: 2015-04-18

5. Oxidative opening of the aromatic ring: Tracing the natural history of a large superfamily of dioxygenase domains and their relatives.

Authors: A Maxwell Burroughs; Margaret E Glasner; Kevin P Barry; Erika A Taylor; L Aravind
Journal: J Biol Chem Date: 2019-05-15 Impact factor: 5.157

6. PqqD is a novel peptide chaperone that forms a ternary complex with the radical S-adenosylmethionine protein PqqE in the pyrroloquinoline quinone biosynthetic pathway.

Authors: John A Latham; Anthony T Iavarone; Ian Barr; Prerak V Juthani; Judith P Klinman
Journal: J Biol Chem Date: 2015-03-27 Impact factor: 5.157

7. Prediction of function for the polyprenyl transferase subgroup in the isoprenoid synthase superfamily.

Authors: Frank H Wallrapp; Jian-Jung Pan; Gurusankar Ramamoorthy; Daniel E Almonacid; Brandan S Hillerich; Ronald Seidel; Yury Patskovsky; Patricia C Babbitt; Steven C Almo; Matthew P Jacobson; C Dale Poulter
Journal: Proc Natl Acad Sci U S A Date: 2013-03-14 Impact factor: 11.205