Literature DB >> 34432510

CANVS: an easy-to-use application for the analysis and visualization of mass spectrometry-based protein-protein interaction/association data.

Erick F Velasquez¹, Yenni A Garcia¹, Ivan Ramirez¹, Ankur A Gholkar¹, Jorge Z Torres^1,2,3.

Abstract

The elucidation of a protein's interaction/association network is important for defining its biological function. Mass spectrometry-based proteomic approaches have emerged as powerful tools for identifying protein-protein interactions (PPIs) and protein-protein associations (PPAs). However, interactome/association experiments are difficult to interpret, considering the complexity and abundance of data that are generated. Although tools have been developed to identify protein interactions/associations quantitatively, there is still a pressing need for easy-to-use tools that allow users to contextualize their results. To address this, we developed CANVS, a computational pipeline that cleans, analyzes, and visualizes mass spectrometry-based interactome/association data. CANVS is wrapped as an interactive Shiny dashboard with simple requirements, allowing users to interface easily with the pipeline, analyze complex experimental data, and create PPI/A networks. The application integrates systems biology databases such as BioGRID and CORUM to contextualize the results. Furthermore, CANVS features a Gene Ontology tool that allows users to identify relevant GO terms in their results and create visual networks with proteins associated with relevant GO terms. Overall, CANVS is an easy-to-use application that benefits all researchers, especially those who lack an established bioinformatic pipeline and are interested in studying interactome/association data.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2021 PMID： 34432510 PMCID： PMC8693966 DOI： 10.1091/mbc.E21-05-0257

Source DB: PubMed Journal: Mol Biol Cell ISSN： 1059-1524 Impact factor: 4.138

INTRODUCTION

Direct protein–protein interactions (PPIs) and indirect protein–protein associations (PPAs) are critical to understanding the biological function of a protein of interest (POI). PPI/As can inform on how a POI is compartmentalized within a cell, how it forms higher-order complexes, how it is regulated, and how it coordinates with other proteins in a spatial and temporal manner to carry out specific cellular processes (Yugandhar ; Lu ). More broadly, PPI/A networks have been used to analyze the composition of cellular structures such as centrosomes, kinetochores, cilia, and other organelles and to define the function of cell cycle proteins (Torres ; Firat-Karalar and Stearns, 2015; Cheung ; Go ; Remnant ; Garcia ; Guo ). Several approaches have been used to identify PPIs, including yeast two-hybrid (Fields and Song, 1989; Chien ), fluorescence resonance energy transfer (Selvin, 1995), and protein fragment complementation assay (Michnick ). Although these classical methods are important for identifying and validating PPIs, mass spectrometry–based approaches have made high-throughput identification of PPIs possible (Yugandhar ). Affinity purification mass spectrometry (AP-MS) has become the conventional method of identifying PPIs, since it isolates protein complexes from cell lysates under near-physiological conditions (Gingras ). More recently, the field has transitioned to defining PPAs, which may represent direct protein–protein interactions or local protein neighborhoods, through proximity-dependent biotinylation methods (Perkins ). A popular method is BioID, where a POI is tagged with a promiscuous biotin ligase, which biotinylates proteins in close proximity to the POI in the presence of biotin (Sears ). Overall, popular pipelines for PPI/A approaches involve identifying a POI, tagging the POI with an appropriate protein tag, expression of the tagged-POI, biochemical purifications, MS of purifications to identify proteins, and qualitative/quantitative bioinformatic analyses of identified proteins (Sears ; Yugandhar ). Mass spectrometry-based approaches to identify PPI/As generate large amounts of PPI/A data, which are difficult to analyze and interpret. To overcome these issues, computational tools have been developed to analyze PPI/A data quantitatively (Choi , 2012; Nesvizhskii, 2012; Teo ), to create visual representations of the analyses (Knight ), and to generate PPI/A networks (Shannon ; Szklarczyk ). One pipeline, APOSTL, integrates these steps and automates the process within a Galaxy framework (Kuenzi ). However, currently APOSTL does not allow users to filter search results by Gene Ontology–based (GO) terms or integrate protein-complex data. Furthermore, most PPI/A data analysis computational tools focus on the accuracy of protein identification, instead of on how the identified proteins might be associated at the molecular level. With this in mind, we sought to develop a computational pipeline that allowed users with no computational background to explore PPI/A data interactively within the context of relevant biological processes, molecular functions, cellular components, and protein–protein complex interactions. Here, we present CANVS (Clean Analyze Network Visualization Software), an open access computational pipeline that cleans mass-spectrometry PPI/A data through statistical analyses and annotates identified proteins with proteoinformatic databases such as BioGRID (Biological General Repository for Interaction Datasets; Oughtred ) and CORUM (Comprehensive Resource of Mammalian Protein Complexes) (Giurgiu ) to create protein interaction/association networks. Furthermore, CANVS allows users to apply GO (Gene Ontology) (Ashburner ; Consortium GO, 2021) filters to create protein networks relevant to biological processes, cellular locations, or molecular functions of interest. To ensure accessibility to all researchers, CANVS is wrapped in a Golem framework (Fay ) and is deployed as a Shiny dashboard app that can be downloaded and installed locally on a Windows system (https://sourceforge.net/projects/canvs/files/). CANVS can be used as a standalone tool; however, the user can also upload results from other proteomic pipelines to generate protein networks quickly and identify proteins with relevant biological associations to a POI. Overall, CANVS provides an easy-to-use interactive framework where proteoinformatic resources are integrated to quantify, contextualize, and visualize data from PPI/A experiments that enable users to better understand the biological role of their POI.

RESULTS AND DISCUSSION

Features

CANVS is an open-access easy-to-use pipeline for studying protein–-protein interactome/association data. It was created so that researchers with no computational background can quickly analyze mass spectrometry data from affinity-based and proximity-based protein purifications, with an emphasis on identifying biologically interesting PPI/As that can be further validated and explored. Briefly, the CANVS pipeline (Figure 1) can be divided into four steps: 1) uploading data, 2) cleaning data to identify significant protein hits, 3) analyzing results by applying proteoinformatic databases, and 4) visualizing the resulting PPI/A networks. The CANVS pipeline takes advantage of user interface web development packages in R and is wrapped as a Shiny dashboard app that can be installed locally on a Windows system.

FIGURE 1:

CANVS workflow. Mass spectrometry data files, comma-delimited text files, with protein UniProt accession numbers, protein descriptions, protein quantitative values (scores), and bait POIs are uploaded and rendered as interactive data tables. To clean the data, users can determine the significance of the identified proteins, given proper controls, using log 2–fold change and a Student’s t test. Significant protein identifications are then analyzed by applying Gene Ontology (GO) terms, the Comprehensive Resource of Mammalian Protein Complexes (CORUM) database, and the Biological General Repository for Interaction Datasets (BioGRID) database. The visNetwork R package is then used to visualize the GO PPI/A, CORUM PPI/A, and BIOGRID PPI/A networks.

Step 1. Data upload/preprocessing.

CANVS accepts csv or tab-delimited text files with five columns containing information on: protein UniProt (Apweiler ) accession number, score or quantitative value, protein description, file name, and protein bait name (Supplemental Material—Uploading Data). Files are then merged, and the merged data table can be accessed in the Upload tab (Supplemental Figure S1). Additionally, the user has the option of uploading one data table each, for the control and experimental, that has all the information about replicates/conditions. The data tables are interactive and the user can search for specific keywords in the search section. Preprocessing involves both determining how many replicates a protein needs to appear in to carry out the analysis and normalization across replicates and baits (Figure 2 and Supplemental Figures S2–S4). High-throughput MS-based approaches to identify PPI/As contain systematic biases due to steps in data processing and generation (Chawade ). To overcome these biases, the field has adopted normalization methods that aim to make samples more comparable across replicates/conditions (Chawade ; Välikangas ). CANVS gives the user the option to normalize by the median and scales samples so that each purification has the same median value (Välikangas ). We recommend using this method if no previous normalization was performed and if the user suspects variation across purifications due to human error (sample preparation) or analytical errors (device calibration, temperature fluctuations, etc.). Both settings are predefined by the user in the Clean tab, where the user can filter by the number of replicates a protein must be identified in to be considered for further analysis and whether to normalize by the median (Figure 2). Please see the Supplemental Material for details and step-by-step user instructions.

FIGURE 2:

CANVS cleaning method. CANVS allows users to upload interaction/association MS data, filter by a minimum number of replicates a protein should be present in, normalize proteins by the median value of each purification, and apply significance statistics. CANVS calculates the log 2–fold change and p values that can be visualized in an interactive volcano plot. The user can then filter by a certain p value or fold change and the results are used in the pipeline for further analysis/visualization.

Step 2. Semi–quantitative/qualitative analysis of protein hits.

After setting up the preprocessing options, users can determine whether to perform a statistical analysis to identify significant protein hits (Supplemental Material—Cleaning Data). CANVS uses spectral counts as a quantitative representation of protein abundance (Lundgren ), specifically the exponentially modified protein abundance index (emPAI), since this score considers the number of peptides per protein, an important metric in IP-MS (Ishihama ). Additionally, other label-free or labeled quantitative values can be used, including results from intensity-based quantification, stable isotope labeling using amino acids in cell culture (SILAC), and tandem mass tag (TMT) experiments (Thompson ; Ong and Mann, 2006; Zhang ). However, the data must be in an array format and representative of the abundance of a protein. If a protein is not present in either a control or an experimental purification, a missing value, CANVS assigns the missing protein half the minimum value of nonmissing proteins in the same purification (Wei ). CANVS has two methods by which the user can filter data, using the log 2–fold change of proteins compared with a control or using a combination of log 2–fold change and significance statistics in the form of a two-tailed Student’s t test (Student, 1908; Hubner ). Calculating the difference in the logarithmic mean protein intensities between experimental and control purifications allows users to identify nonspecific associations that center around zero (Hubner ; Singh ). If a sufficient number of replicates (we recommend three biological replicates and two technical replicates) are used in the analysis, the user has the option of calculating the significance in the log 2–fold change using a two-tailed Student’s t test. Comparing the negative log of the p value with the log 2–fold change creates a volcano plot where background proteins cluster at zero (Singh ). The interactive volcano plot is displayed in the results box of the Clean tab (Figure 2 and Supplemental Figures S2–S4). The user then has the option of analyzing the data by log 2–fold change only or log 2–fold change and significance statistics. This is done by changing the drop-down menu under Analysis Method (Supplemental Material—Cleaning Data). The user can filter by a specific fold change of interest or a different p value; however, the preset options are set at a log change of 0.6 and a p value of 0.05. Additionally, if the user is concerned about the multiple testing problem and does not wish to use a strict p-value cutoff, an option is available to adjust the p value via the Benjamini and Hochberg method (Benjamini ). Briefly, adjusting the p value controls the false discovery rate and therefore corrects for the expected number of false positives among all positives that rejected the null hypothesis (Jafari and Ansari-Pour, 2019). Results appear in the form of a data table in the Clean tab, and are referenced throughout the rest of the pipeline. Alternatively, users can elect not to carry out a statistical analysis and use CANVS solely to contextualize and visualize the results. To do so, users can select the no analysis option in the Clean tab and the program will consider data in the experimental as the results. Even when no statistical analysis is performed, the user must press the run button in the Clean tab to let CANVS know that the data uploaded in the experimental should be used as the results. This feature allows CANVS to be integrated easily with any other statistical pipeline of choice. The experimental data will then appear in the form of a data table in the Clean tab, and this data table will be referenced in the rest of the pipeline (Figure 2 and Supplemental Figures S2–S4).

Step 3. Data analysis.

CANVS features two tabs to analyze and generate visual network representations of the results (Supplemental Material—Analysis/Visualization). The first Analyze/Visualize tab (Supplemental Figures S5–S7) creates networks for all identified proteins. The GO Analyze/Visualize tab (Supplemental Figures S8 and S9) features a GO-based (Ashburner ; Consortium TGO, 2019) filtering tool that allows users to search the protein results for associations with GO terms of interest. The GO database classifies GO terms into three categories: biological processes (BP), molecular function (MF), and cellular component (CC). CANVS links proteins in the results to associated GO terms. Users can search by keyword and CANVS searches for GO terms that have that keyword. GO terms with the keyword and their respective subterms are selected. CANVS then renders two objects: a color-coded network representing how the selected GO terms are related and a data table with all the selected GO terms (Figure 3). Users can then select GO terms of interest in the data table, which instructs CANVS to filter the results for proteins that have the selected GO terms. Alternatively, if users want to filter the results with all of the GO terms in the table, no selection is necessary and the network visuals will reflect proteins associated with all of the GO terms. This feature is particularly helpful if users are interested in specific biological processes, molecular functions, and/or cellular components and want to search the results for proteins associated with such GO terms.

FIGURE 3:

Selection of Gene Ontology GO terms and filtering results with selected GO terms. Users can perform a keyword search and GO terms containing the keyword/s of interest and any associated subterms are retrieved. Only GO terms associated with protein hits in the dataset will appear and can be selected and applied as a filter. Proteins with the associated GO terms of interest are included in the network tables. Once the user determines whether to include all proteins in the results or selects certain proteins based on GO terms, proteins in the result can be further annotated and contextualized. CANVS integrates two main databases: the Biological General Repository for Interaction Datasets (BioGRID v. 4.3; Oughtred ) and the Comprehensive Resource of Mammalian Protein Complexes (CORUM v. 3.0; Giurgiu ). BioGRID contextualizes results in terms of previously identified protein–protein interactions, whereas, CORUM contextualizes the results in terms of known protein–protein complex information. For both databases, the user first defines an organism of interest and annotations are performed within the context of that organism. See the Supplemental Material for a full list of organisms supported by CANVS, BioGRID, and CORUM. By analyzing the results from both databases, the user can identify known PPIs and complex-PPI/As present in the results.

Step 4. Data visualization.

After the results have been annotated (using CORUM or BioGRID) or selected (using associated GO terms), three types of networks are created: protein–protein network (Figure 4A), CORUM protein–protein complex network (Figure 4B and Supplemental Figure S10), and a BioGRID protein–protein network (Figure 4C and Supplemental Figure S11). Visual representations are powered by visNetwork, an R package that provides a framework for creating visual networks in an interactive environment (Almende, 2021). Considering a basic network, the user can interact with the graphs by zooming in on certain sections of a network, selecting certain proteins to highlight interactions, and downloading edited networks as png files. Networks can also be reset and created again, a helpful feature when considering different parameters that might influence which proteins are rendered in the networks. The user also has the option to download Cytoscape network files (Shannon ), which can be used in Cytoscape to recreate networks developed using CANVS. See the Supplemental Material for a detailed walkthrough on how to use CANVS network visuals.

FIGURE 4:

Creation of interactive PPI/A networks of (A) protein hits associated with the selected GO terms integrating (B) CORUM protein complex information and (C) BioGRID PPI information.

Conclusions

CANVS is an interactive tool that allows scientists to integrate systems biology databases and create PPI/A networks with biologically relevant results. The simple requirements of the application, along with its interactive networks, make CANVS a powerful tool to be used in conjunction with other tools or as a standalone pipeline. CANVS is particularly useful to researchers studying a POI or sets of proteins and wanting to contextualize the results of their interactome/association experiments. The integration of BioGRID allows the user to compare the results to a wide set of interactome experiments and create PPI networks based on previous PPI data. Similarly, the integration of CORUM allows users to identify protein complexes within their results, providing context as to how a POI or bait might be related to other proteins in the network. Additionally, by creating networks where specific molecular functions, biological processes, or cellular components are prioritized, the user can quickly parse through the results and concentrate on PPI/As that are relevant to their scientific question. Ultimately, additional databases and statistical methods can be integrated into the pipeline. It is important to note that protein interactions and associations should be validated biochemically. As examples, we recently used the analytical framework included in CANVS to study the PPI/A networks of DUSP7, which helped to define its regulation of ERK2 during mitosis (Guo ), and to analyze the PPA networks of core spindle assembly checkpoint proteins (Garcia ). Overall, CANVS offers an interactive and easy-to-use solution to study PPI/A data that can be used by laboratories without an established proteomic analysis pipeline.

MATERIALS AND METHODS

Request a protocol through Bio-protocol.

Testing CANVS with mass spectrometry data from BioID experiments

To test CANVS, we utilized a previously published mass spectrometry data set from BioID-based experiments of core spindle assembly checkpoint proteins (Garcia ). Briefly, BioID2-tagged inducible HeLa stable cell lines were generated for core spindle assembly checkpoint (SAC) proteins (BUB1, BUB3, BUBR1, MAD1L1, MAD2L1). These cell lines were induced to express the BioID2-tagged core SAC proteins, incubated with biotin, and BioID purifications were performed in triplicate for each bait except for BUB3 which was performed in duplicate. A BioID2-tag alone cell line was used as control. The purifications were then analyzed by LC-MS/MS and peptide identification was conducted with Mascot (v2.4; Matrix Science, Boston, MA) against the UniProt human database (October 10, 2018). Search parameters included trypsin digestion allowing up to two missed cleavages, carbamidomethyl on cysteine as a fixed modification, oxidation of methionine as a variable modification, 10-ppm peptide mass tolerance, and 0.02-Da fragment mass tolerance. Peptides that surpassed a cut-off score of 20, assuming a 5% false discovery rate, were accepted. From the SAC protein BioID panel, 387 proteins where identified with at least two significant peptides and where further processed using CANVS. To perform a statistical analysis of proteins identified in both the control and experimental, emPAI scores (Ishihama ) were considered as quantifiable values. Files compatible with CANVS were then generated by summarizing search results by UniProt accession ID, protein description, quantifiable (emPAI score), file name, and associated bait. This dataset is available at the SourceForge directory (https://sourceforge.net/projects/canvs/files/), where users can download it for reference.

Statistical analysis

Given quantitative scores, the significance of proteins shared between the experimental and the control can be determined using log 2–fold change values and significance statistics (Hubner ; Singh ). Proteins are first filtered by the number of replicates each protein is present in, a number chosen by the user. Proteins with the appropriate replicate count are further tested for significance. If a protein is not present in either a control or experimental purification, half of the minimum value of that purification is used (Wei ). Fold change is calculated by comparing the mean value of a protein across experimental purifications to the same mean value of a protein across control purifications. CANVS then calculates the log base 2 of the fold change, since it is beneficial to represent the distribution around zero, a value that indicates no change for a protein between the experimental and control. If a sufficient number of replicates (we suggest at least three biological replicates and two technical replicates) are used in the analysis, the user has the option of calculating the significance in the log 2–fold change using a two-tailed Student’s t test. Additionally, the user can adjust the p value using the Benjamini and Hochberg method (Benjamini ).

Integration of system biology databases

Three main system biology databases are integrated in CANVS: the Biological General Repository for Interaction Datasets (BioGRID v. 3.5; Oughtred ), the Comprehensive Resource of Mammalian Protein Complexes (CORUM v. 3.0; Giurgiu ), and Gene Ontology (Ashburner ; Consortium TGO, 2019). To incorporate each database, the entire database was downloaded and transformed into an R object that is then referenced in a function. In the case of Gene Ontology, both GO terms and sub-terms were merged by GO term ID. To identify data that correspond to certain proteins across all three databases, UniProt accession numbers are translated to common gene names using the R package org.HS.eg.db (Carlson, 2019). The package includes local versions of each database. Updates to databases in the package will be performed every six months.

Development of Shiny dashboard

CANVS follows a Golem framework, with the motivation of creating an easy-to-use application with simplified modules (Fay ). Modules, functions, and helper functions were created using the framework outlined in the Golem package. To make the app interactive yet intuitive, a dashboard framework was implemented where the user is able to perform a part of the pipeline in each tab. Visual networks are created using visNetwork, an R package designed to create interactive network visuals (Almende, 2021). Briefly, protein results are formatted to include node information, edge information, and annotations for each node including ID and visual components that can be changed by the user (color, shape, etc.). The networks are then rendered in an interactive Shiny session in the format of a Shiny dashboard. The app was deployed using protocols from DesktopDeployR, a framework for deploying self-contained R-based applications in Windows (https://github.com/wleepang/DesktopDeployR). For step-by-step instructions on how to use CANVS, refer to the Supplemental Material or the main repository.

Installation

CANVS is packaged as an R Shiny dashboard app, but since it includes data from large databases, the application is deployed locally. Currently the local version of the app can only be used in a Windows system. To install CANVS, download the CANVS.zip file in the main repository (https://sourceforge.net/projects/canvs/files/). Unzip the file in a location where you want to save the application. Open the folder and right click on the CANVS.bat file; then select create a shortcut. Name the shortcut CANVS and then drag the shortcut to the desktop. Then double click on the new shortcut and CANVS will open in a web browser window. Please make sure that there are no antivirus software blocking application connections to ports in the computer and that firewall does not block use of ports. Click here for additional data file. Click here for additional data file.

42 in total

1. Detection of protein-protein interactions by protein fragment complementation strategies.

Authors: S W Michnick; I Remy; F X Campbell-Valois; A Vallée-Bélisle; J N Pelletier
Journal: Methods Enzymol Date: 2000 Impact factor: 1.600

2. Controlling the false discovery rate in behavior genetics research.

Authors: Y Benjamini; D Drai; G Elmer; N Kafkafi; I Golani
Journal: Behav Brain Res Date: 2001-11-01 Impact factor: 3.332

3. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

4. The STARD9/Kif16a kinesin associates with mitotic microtubules and regulates spindle pole assembly.

Authors: Jorge Z Torres; Matthew K Summers; David Peterson; Matthew J Brauer; James Lee; Silvia Senese; Ankur A Gholkar; Yu-Chen Lo; Xingye Lei; Kenneth Jung; David C Anderson; David P Davis; Lisa Belmont; Peter K Jackson
Journal: Cell Date: 2011-12-09 Impact factor: 41.582