Literature DB >> 27085803

ProViz-a web-based visualization tool to investigate the functional and evolutionary features of protein sequences.

Peter Jehl¹, Jean Manguy¹, Denis C Shields¹, Desmond G Higgins¹, Norman E Davey².

Abstract

Low-throughput experiments and high-throughput proteomic and genomic analyses have created enormous quantities of data that can be used to explore protein function and evolution. The ability to consolidate these data into an informative and intuitive format is vital to our capacity to comprehend these distinct but complementary sources of information. However, existing tools to visualize protein-related data are restricted by their presentation, sources of information, functionality or accessibility. We introduce ProViz, a powerful browser-based tool to aid biologists in building hypotheses and designing experiments by simplifying the analysis of functional and evolutionary features of proteins. Feature information is retrieved in an automated manner from resources describing protein modular architecture, post-translational modification, structure, sequence variation and experimental characterization of functional regions. These features are mapped to evolutionary information from precomputed multiple sequence alignments. Data are displayed in an interactive and information-rich yet intuitive visualization, accessible through a simple protein search interface. This allows users with limited bioinformatic skills to rapidly access data pertinent to their research. Visualizations can be further customized with user-defined data either manually or using a REST API. ProViz is available at http://proviz.ucd.ie/.

Entities: Disease Gene Species

Mesh：

Year: 2016 PMID： 27085803 PMCID： PMC4987877 DOI： 10.1093/nar/gkw265

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Proteins are modular entities consisting of autonomous functional regions such as globular domains (1–3), disordered domains (4–6) and short linear motifs (SLiMs) (7–9). These modules are regularly modulated by post-translational modifications (PTMs) (10,11) and can be added or removed by alternative transcription or alternative splicing to produce protein isoforms with unique functional properties (12,13). Furthermore, single nucleotide polymorphisms (SNPs) can disrupt the normal function of these modules often resulting in deleterious outcomes that underlie disease (14,15). Over the past decade advances in biochemical, proteomic and genomic methods have rapidly expanded our understanding of these aspects of protein biology. Biochemical studies have revealed residues and regions of functional importance. Structural biology techniques have produced detailed structures of protein regions both in their unbound and bound state (16). Proteomic studies continue to expand the census of PTMs (10,11) and are now being applied to the difficult task of SLiM discovery (17,18). Genomic studies have catalogued both disease-causing and natural variant non-synonymous SNPs (14,15); and temporal, spatial or cell type-specific regions of proteins encoded by non-constitutive exons (13). Computational methods can accurately predict protein sequence features and attributes. Homology-based inference is widely used to map functional modules to regions of unstudied proteins (19). Sequence analysis tools can accurately define regions of proteins that are unlikely to have any structure in their native state (20,21) and predict protein membrane topology relative to membrane crossing regions (22). Finally, large-scale sequencing efforts have produced complete bacterial, archeal, eukaryotic and viral proteomes allowing detailed investigation of protein sequence evolution, thereby pinpointing regions of functional constraint (23,13). The accumulation of these data has expanded our understanding of many aspects of protein function. However, we are faced with enormous quantities of data dispersed across many different resources. Tools for the aggregation and visualization of a protein's modular architecture, experimental information and evolution are therefore central to our ability to consolidate and digest this information. Existing tools to visualize protein-related data are restricted by their accessibility, sparse functionality, limited sources of data and lack of user friendly interfaces. Several powerful tools allowing complex alignment and feature manipulation that run locally on a user's machines are available (for example, JalView (24), CLC-Workbench and STRAP (25)). However, they require a knowledge of resources and methods to access pertinent data and their extensive functionality far exceeds the requirements of many users. Recent developments driven by novel JavaScript libraries, HTML5 and CSS3 have expanded the potential capabilities of browser-based bioinformatics tools. For many computationally and data intensive bioinformatics problems browser-based applications are still unsuitable. For certain tasks, however, these developments provide a powerful framework. In particular, the graphical ability and interactivity of such a framework are ideally suited to visualization tools. Although, developers must still be careful as interfaces can become sluggish if large amounts of data are not handled carefully. To date, several browser-based protein data visualization tools of varying degrees of sophistication have been developed including Java applet based tools (JalView2 Lite (24), STRAP (25), PFAAT (26)), resource specific tools (PDB viewers (27,28), Pfam (3), webPrank (29)) and general alignment/feature viewers (Alignment-Annotator (30), MView (31), JSAV (32)) (Supplementary Table S1). However, none of these tools truly leverage the available protein data resources and web frameworks to their fullest potential. To rectify this we have developed ProViz, a novel interactive browser-based visualization tool to investigate the functional and evolutionary features of protein sequences.

MATERIALS AND METHODS

Input options

ProViz provides a simple search interface built on the UniProt protein search engine (12). The search interface takes a gene or protein name and creates a table of results from which the protein of interest can be chosen. In cases where the protein of interest is not returned a search can be refined by adding the species of interest or, if available, the UniProt identifier (e.g SRC_HUMAN) or UniProt accession (e.g. P12931).

Main visualization

The ProViz main visualization displays a protein of interest as a single linear peptide. This query protein is annotated with evolutionary and functional data (Figure 1). The data are mapped to each residue or a range of residues directly below the query sequence. The visualization consists of two sections: sequence data and feature data. The sequence data section displays the sequence of the query protein and, when available, a multiple sequence alignment of proteins homologous to the query protein. The feature data section displays data on the modular architecture, PTM state, structure, sequence variation and experimental characterization of functional regions for the query protein. An additional visualization, the protein architecture section, displays an overview of the query protein annotated with key features.

Figure 1.

Schema describing data retrieval, data processing and data display for the ProViz protein visualization tool. A user inputs a protein search term or user-defined sequence data (a protein sequence or protein multiple sequence alignment). Search terms (and, if possible, user-defined sequence data) are mapped to a UniProt accession. On the server-side, the UniProt accession is used to retrieve feature data from various resources and sequence-based prediction tools are applied to the protein sequences. All data is processed and returned to the browser-based ProViz front end for visualization. Blue boxes denote functions, red boxes denote external data sources, yellow boxes denote local data sources, grey boxes denote local bioinformatic tools and green boxes denote processed data.

Sequence data

The sequence data section displays protein and alignment data (Figure 2A). ProViz visualizations are built around a query sequence and, consequently, the query sequence is always displayed by default. Where possible, a multiple sequence alignment of proteins homologous to the query protein is displayed. To enable the mapping of features to the protein sequences, the alignment is degapped with respect to the query sequence. That is, columns of the alignment which are gaps in the query sequence are hidden. Residues flanking these hidden regions are displayed in lowercase. The sequence of a hidden region can be displayed by hovering over the flanking amino acids. A complete gapped alignment can be displayed using the options toolbar, but, feature data are not shown in this view. Both the query protein sequence and alignments are coloured according to the rules of the ClustalX (33) colouring scheme to highlight conserved and physicochemically similar amino acids. ProViz accesses precomputed alignments automatically from two sources: orthologue alignments created by GOPHER (Generation of Orthologous Proteins from High-throughput Estimation of relationships) (34) (Supplementary Table S2) and GeneTree homologue alignments from EnsEMBL (13). GeneTree alignments can be filtered to display paralogue or orthologue alignments as well as complete homologue alignments. The available alignments for a query protein are displayed and can be selected in the options toolbar. Due to browser performance issues large alignments are restricted to model organisms by default, however, aligned proteins or the whole alignment can be added back to the visualization by the user through the options sidebar.

Figure 2.

(A) ProViz visualization for Cyclin-dependent kinase inhibitor 1A (CDKN1A) showing selected features of CDKN1A and a GeneTree alignment of CDKN1A orthologues. Key aspects of the visualization are numbered: (1) Protein name and species; (2) options sidebar; (3) data information sidebar; (4) data select, hide and help buttons; (5) information hover tooltip; (6) options toolbar; (7) protein architecture overview; (8) protein sequence data; (9) protein feature data. The visualization in the example can be viewed at http://proviz.ucd.ie/proviz.php?uniprot_acc=P38936. (B) A zoomed view of a section of the visualization from panel A labelled with the types of data that are present in each section of the protein feature data. (C) Examples of the available track types.

Feature data

The query protein sequence is also annotated with protein feature data from numerous resources (Figure 2A and Supplementary Table S3) (3,9,10,12–15,18,19,22,35,36). The presented feature data describes diverse aspects of protein biology (Figure 2B). Complementary computed results from bioinformatics tools are also presented, including disorder predictions (20,21), binding site predictions (37), SLiM consensus matches (9) and residue relative conservation scores (38,39). Three different classes of feature data tracks are used (Figure 2C). Features mapping to a continuous segment of the query proteins (for example, a domain or transmembrane region) are displayed as horizontal bars spanning the corresponding residues of the proteins. Bars are also used to display single amino acid features e.g. modification sites or SNPs. Peptide tracks are similar to bar tracks but display amino acids directly below the corresponding residue in the query protein. Peptide tracks are used for displaying the exact sequence of a region or amino acid of interest such as the tested residue or residues in a mutagenesis experiment. Histogram tracks display quantitative data for the protein on a residue by residue basis. Data are displayed as vertical blocks corresponding to the value given to the residue. Positive and negative values are possible and values are normalized to fit the track height.

Protein architecture

The protein architecture section displays an overview of the query protein (Figure 2A). The section shows a compact visualization of the protein architecture showing key features of the protein: secondary structure, topology, globular domains, SLiMs and PTM sites. Users can move a slider or ctrl + click on a feature to rapidly navigate to the specific region of the main visualization. The overview also identifies the region of the query protein currently displayed in the main visualization.

Interactivity

The displayed ProViz data views are highly interactive and customizable. All sequences and tracks can be reordered or hidden. Alignment data can be restricted to proteins from a set of model organisms. Hidden data can be added back to the visualization using the options sidebar. When sequences are removed from an alignment, residues of the reduced alignment can be recoloured and columns solely consisting of gaps can be removed. Users can directly specify a range of residues, move a slider or ctrl + click on a feature to select a target area. The selected area can be used to resize or highlight the visualization. A condensed view is available that shrinks the visualization to double the viewable area. Most elements in the visualization have associated tooltips which upon hovering show detailed information about the element. All features open the web page of their source data upon clicking. Similarly, protein labels link to protein source data. A small tab next to the protein labels opens a new ProViz visualization with the selected protein as the new query sequence. ProViz alignments can be downloaded in FASTA format and the complete visualization can be downloaded in PDF format. Finally, a regular expression can be searched against the sequence data to highlight matching protein subsequences.

Advanced options and customization

ProViz's advanced visualization customization allows user to integrate ProViz into biological resources or present external data. An extensive list of URL options can produce customized visualizations. For example, a specific alignment, set of features or range of amino acids can be be displayed (see Supplementary Table S4 for more details). Custom proteins or alignments can be submitted in FASTA format via the homepage. As ProViz relies on the UniProt accession to retrieve features for the query protein, an MD5 hash value is calculated for the query protein and is searched against the UniParc database. For alignments, the first protein of the alignment is taken as the query protein. If the protein's UniProt identifier is identified, feature data are loaded as normal. If the protein's UniProt identifier cannot be identified, no features are shown. Sequence dependent predictive tools, however, and user provided custom information are still displayed. ProViz also offers the option to add custom tracks to the main visualization. Bars, histogram and peptide tracks are all available. This can be achieved by providing a file in ‘XML’, ‘CSV’ or ‘JSON’ format by drag/drop, file upload or URL based direction to a REST service created or pre-computed file in ProViz format. Example files for custom feature input in ‘XML’, ‘CSV’ and ‘JSON’ formats are available on the ProViz website and are described in the Supplementary Data (Supplementary Tables S5 and S6 and Supplementary File 7).

DISCUSSION

We have introduced ProViz, a novel browser-based interactive exploration tool to investigate the functional and evolutionary features of protein sequences. The browser-based interface of ProViz provides numerous advantages, especially relating to ease of accessibility. However, there are also drawbacks. The most obvious issue is browser performance which can result in a lag when displaying or interacting with visualizations containing large amounts of data. Big alignments and large proteins are a particularly problem and may cause significant problems on older browsers and low specification machines. Similarly, much of the complex functionality that can be performed instantly in local protein data visualization tools, such as recolouring alignments upon sequence removal, require server-side recalculation of the visualization data. As such ProViz should be considered a protein exploration tool that includes protein multiple sequence alignments rather than a fully functioned protein multiple sequence alignment viewer. Nevertheless, ProViz provides a unique resource to quickly gain insight into the function and evolution of proteins. ProViz is a versatile tool that can aid biologist in many ways, for example, building hypotheses, designing experiments or understanding experimental results. The flexible customization options of ProViz also permit bioinformaticians to extensively customize the visualization, for example, by mapping data from high-throughput proteomic studies or adding the results of per residue predictive bioinformatic tools to the presented data. The current version of the tool provides a solid foundation on which to build in the future. Planned extensions include incorporating additional relevant sources of data, sequence-based prediction tools and alignment colouring options. We believe ProViz is an invaluable time-saving resource that will become an integral part of the day to day work of a biologist.

39 in total

1. The Protein Data Bank.

Authors: H M Berman; J Westbrook; Z Feng; G Gilliland; T N Bhat; H Weissig; I N Shindyalov; P E Bourne
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.

Authors: A Krogh; B Larsson; G von Heijne; E L Sonnhammer
Journal: J Mol Biol Date: 2001-01-19 Impact factor: 5.469

Review 3. Structure, function and evolution of multidomain proteins.

Authors: Christine Vogel; Matthew Bashton; Nicola D Kerrison; Cyrus Chothia; Sarah A Teichmann
Journal: Curr Opin Struct Biol Date: 2004-04 Impact factor: 6.809

4. SnapShot: Intrinsic Structural Disorder.

Authors: Mainak Guharoy; Kris Pauwels; Peter Tompa
Journal: Cell Date: 2015-05-21 Impact factor: 41.582

5. The switches.ELM resource: a compendium of conditional regulatory interaction interfaces.

Authors: Kim Van Roey; Holger Dinkel; Robert J Weatheritt; Toby J Gibson; Norman E Davey
Journal: Sci Signal Date: 2013-04-02 Impact factor: 8.192

Review 6. A million peptide motifs for the molecular biologist.

Authors: Peter Tompa; Norman E Davey; Toby J Gibson; M Madan Babu
Journal: Mol Cell Date: 2014-07-17 Impact factor: 17.970

7. Jalview Version 2--a multiple sequence alignment editor and analysis workbench.

Authors: Andrew M Waterhouse; James B Procter; David M A Martin; Michèle Clamp; Geoffrey J Barton
Journal: Bioinformatics Date: 2009-01-16 Impact factor: 6.937

8. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Authors: Fabian Sievers; Andreas Wilm; David Dineen; Toby J Gibson; Kevin Karplus; Weizhong Li; Rodrigo Lopez; Hamish McWilliam; Michael Remmert; Johannes Söding; Julie D Thompson; Desmond G Higgins
Journal: Mol Syst Biol Date: 2011-10-11 Impact factor: 11.429

9. Ensembl 2016.

Authors: Andrew Yates; Wasiu Akanni; M Ridwan Amode; Daniel Barrell; Konstantinos Billis; Denise Carvalho-Silva; Carla Cummins; Peter Clapham; Stephen Fitzgerald; Laurent Gil; Carlos García Girón; Leo Gordon; Thibaut Hourlier; Sarah E Hunt; Sophie H Janacek; Nathan Johnson; Thomas Juettemann; Stephen Keenan; Ilias Lavidas; Fergal J Martin; Thomas Maurel; William McLaren; Daniel N Murphy; Rishi Nag; Michael Nuhn; Anne Parker; Mateus Patricio; Miguel Pignatelli; Matthew Rahtz; Harpreet Singh Riat; Daniel Sheppard; Kieron Taylor; Anja Thormann; Alessandro Vullo; Steven P Wilder; Amonida Zadissa; Ewan Birney; Jennifer Harrow; Matthieu Muffato; Emily Perry; Magali Ruffier; Giulietta Spudich; Stephen J Trevanion; Fiona Cunningham; Bronwen L Aken; Daniel R Zerbino; Paul Flicek
Journal: Nucleic Acids Res Date: 2015-12-19 Impact factor: 16.971

10. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

23 in total

1. Systematic Discovery of Short Linear Motifs Decodes Calcineurin Phosphatase Signaling.

Authors: Callie P Wigington; Jagoree Roy; Nikhil P Damle; Vikash K Yadav; Cecilia Blikstad; Eduard Resch; Cassandra J Wong; Douglas R Mackay; Jennifer T Wang; Izabella Krystkowiak; Devin A Bradburn; Eirini Tsekitsidou; Su Hyun Hong; Malika Amyn Kaderali; Shou-Ling Xu; Tim Stearns; Anne-Claude Gingras; Katharine S Ullman; Ylva Ivarsson; Norman E Davey; Martha S Cyert
Journal: Mol Cell Date: 2020-07-08 Impact factor: 17.970

2. Exploring the rearrangement of sensory intelligence in proteobacteria: insight of Pho regulon.

Authors: Varsha Jha; Hitesh Tikariha; Nishant A Dafale; Hemant J Purohit
Journal: World J Microbiol Biotechnol Date: 2018-11-09 Impact factor: 3.312

3. PSSMSearch: a server for modeling, visualization, proteome-wide discovery and annotation of protein motif specificity determinants.

Authors: Izabella Krystkowiak; Jean Manguy; Norman E Davey
Journal: Nucleic Acids Res Date: 2018-07-02 Impact factor: 16.971

4. Metapredict: a fast, accurate, and easy-to-use predictor of consensus disorder and structure.

Authors: Ryan J Emenecker; Daniel Griffith; Alex S Holehouse
Journal: Biophys J Date: 2021-09-02 Impact factor: 3.699

5. BacFITBase: a database to assess the relevance of bacterial genes during host infection.

Authors: Javier Macho Rendón; Benjamin Lang; Gian Gaetano Tartaglia; Marc Torrent Burgas
Journal: Nucleic Acids Res Date: 2020-01-08 Impact factor: 16.971

6. The DBSAV Database: Predicting Deleteriousness of Single Amino Acid Variations in the Human Proteome.

Authors: Jimin Pei; Nick V Grishin
Journal: J Mol Biol Date: 2021-03-04 Impact factor: 6.151

7. SLiMSearch: a framework for proteome-wide discovery and annotation of functional modules in intrinsically disordered regions.

Authors: Izabella Krystkowiak; Norman E Davey
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

8. A highly conserved glutamic acid in ALFY inhibits membrane binding to aid in aggregate clearance.

Authors: Erin F Reinhart; Nicole A Litt; Sarah Katzenell; Maria Pellegrini; Ai Yamamoto; Michael J Ragusa
Journal: Traffic Date: 2020-12-01 Impact factor: 6.215

9. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation.

Authors: Gábor Erdős; Mátyás Pajkos; Zsuzsanna Dosztányi
Journal: Nucleic Acids Res Date: 2021-07-02 Impact factor: 16.971

10. XSuLT: a web server for structural annotation and representation of sequence-structure alignments.

Authors: Bernardo Ochoa-Montaño; Tom L Blundell
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971