George C Linderman1, Mark R Chance, Gurkan Bebek. 1. Department of Biomedical Engineering, Center for Proteomics and Bioinformatics, Case Western Reserve University, 10900 Euclid Ave, Cleveland, OH 44106, USA.
Abstract
MicroArray Gene expression and Network Evaluation Toolkit (MAGNET) is a web-based application that provides tools to generate and score both protein-protein interaction networks and coexpression networks. MAGNET integrates user-provided experimental measurements with high-throughput proteomic datasets, generating weighted gene-gene and protein-protein interaction networks. MAGNET allows users to weight edges of protein-protein interaction networks using a logistic regression model integrating tissue-specific gene expression data, sub-cellular localization data, co-clustering of interacting proteins and the number of observations of the interaction. This provides a way to quantitatively measure the plausibility of interactions in protein-protein interaction networks given protein/gene expression measurements. Secondly, MAGNET generates filtered coexpression networks, where genes are represented as nodes, and their correlations are represented with edges. Overall, MAGNET provides researchers with a new framework with which to analyze and generate gene-gene and protein-protein interaction networks, based on both the user's own data and publicly available -omics datasets. The freely available service and documentation can be accessed at http://gurkan.case.edu/software or http://magnet.case.edu.
MicroArray Gene expression and Network Evaluation Toolkit (MAGNET) is a web-based application that provides tools to generate and score both protein-protein interaction networks and coexpression networks. MAGNET integrates user-provided experimental measurements with high-throughput proteomic datasets, generating weighted gene-gene and protein-protein interaction networks. MAGNET allows users to weight edges of protein-protein interaction networks using a logistic regression model integrating tissue-specific gene expression data, sub-cellular localization data, co-clustering of interacting proteins and the number of observations of the interaction. This provides a way to quantitatively measure the plausibility of interactions in protein-protein interaction networks given protein/gene expression measurements. Secondly, MAGNET generates filtered coexpression networks, where genes are represented as nodes, and their correlations are represented with edges. Overall, MAGNET provides researchers with a new framework with which to analyze and generate gene-gene and protein-protein interaction networks, based on both the user's own data and publicly available -omics datasets. The freely available service and documentation can be accessed at http://gurkan.case.edu/software or http://magnet.case.edu.
A large amount of protein–protein interactions (PPIs) and gene expression data has become recently available from high-throughput techniques, such as yeast two-hybrid arrays, microarray gene expression arrays and whole transcriptome shotgun sequencing. Known PPIs are often collected into publicly available databases such as IntAct (1), BioGrid (2) and human protein reference database (HPRD) (3). It is natural to model both protein–protein and gene–gene interactions as a graph, where nodes correspond to genes/proteins, and edges correspond to interactions. Modeling protein/gene interactions as a network allows researchers to use a systems perspective in studying the relationships between different genes/proteins and allows for a host of new analysis techniques. These techniques include generating coexpression networks from mRNA gene expression data (4), PPI networks (5), gene regulatory networks (6), signaling pathways (7), probabilistic networks (8), and predicting reference networks by integrating datasets (9).Earlier, methods integrating heterogeneous types of high-throughput biological data were presented for gene function prediction (10), biological network discovery (11) and comparative interaction network analysis (12). Species-specific data mining and integration tools/portals also have been developed for Arabidopsis thaliana (13), Drosophila melanogaster (14), and Saccharomyces cerevisiae (15). However, the interactomes generated in recent years using high-throughput data have limited specificity, and the noisy and incomplete nature of the data undermines the results in many promising studies (16). We present an easy-to-use online toolbox, the MicroArray Gene expression and Network Evaluation Toolkit (MAGNET) that provides a solution to this problem. MAGNET integrates publicly available –omics data and user-provided gene/protein expression data into a logistic regression model to provide a weighted PPI, corresponding with the probability that it is a true interaction (17). In addition to weighting PPI networks, MAGNET can generate coexpression networks of user-defined sets of genes using corresponding mRNA expression data, where the associated weights correspond to the Pearson’s or Spearman’s Correlation Coefficients.MAGNET’s web interface was developed to accept individual experiments from the largest public repository for high-throughput gene expression data, Gene Expression Omnibus (GEO) (18). Users can download files from GEO and directly submit them to MAGNET. Although GEO is easily accessible, it is often cumbersome to read, filter and analyze these files as most exceed the capabilities of modern spreadsheet software. When weighting PPI networks, the user simply must specify the types of publicly available data to incorporate into the logistic regression model (localization, literature references, co-clustering), and the toolbox retrieves the data from its own database. This allows researchers to harness the power of a diverse group of public databases without having to deal with each of the different formats and standards (Figure 1). Overall, MAGNET provides value by easily processing and visualizing system-wide datasets to the end of generating or prioritizing interactions for further evaluation.
Figure 1.
MAGNET processes are shown. The user is asked to supply gene expression datasets and gene list(s) as necessary. The integrated databases are shown at top, and the final output for each process is shown in boxes below. The networks can be viewed as both tables and interactive graphs drawn with a web-based network viewer.
MAGNET processes are shown. The user is asked to supply gene expression datasets and gene list(s) as necessary. The integrated databases are shown at top, and the final output for each process is shown in boxes below. The networks can be viewed as both tables and interactive graphs drawn with a web-based network viewer.
MAGNET WEB SERVER
Scoring protein–protein interaction networks
MAGNET assigns weights to known PPIs by integrating sub-cellular localization data, co-clustering coefficient, number of literature observations and user-provided mRNA-level co-expression data (17). Each of these four variables, , is incorporated into a logistic regression model, where the probability of a true interaction between two proteins and given the variables is (17,19). Using a ‘golden’ dataset of experimentally verified interactions as a positive training set (20), and randomly selected (non-golden) interactions as negative training set (500 each), the model is trained to the specific experiment. By re-training MAGNET for every job that is submitted, MAGNET can find the optimal coefficients for each of the variables depending on its usefulness in determining the plausibility of the given interaction. MAGNET repeats the training step for a user-defined number of iterations and then takes an average over the resulting variable weights to determine the final set of constants (). For example, in some cases localization data were more useful and hence given a higher weight, whereas in others the localization data were not as beneficial. MAGNET uses this trained logistic model to score the PPI network, generating a weighted network. Therefore, each edge is associated with a probability showing how likely it is that the interaction exists, given the values of the four variables. It is important to note that the model is trained for every job that is submitted, which results in a model that is trained specifically to score interactions based on the user-provided microarray data.The first of the four variables, sub-cellular localization, is based upon the reasonable assumption that two interacting proteins are more likely to interact if they are co-localized to the same cellular component. The localization information is obtained from Gene Ontology cellular component annotations (21). A positive value (+1) is assigned if the proteins share at least one sub-compartment, whereas a negative value (−1) is assigned if they do not. While most of the proteins have this information, if there is no annotation found, they are scored with zero (0) to avoid unnecessarily penalizing these interactions. The second variable, the co-clustering coefficient, measures the connectedness of the neighbors of two given proteins, which has been shown to suggest a higher probability of interaction (22). The third variable measures the number of times that a given interaction has been reported across the PPI databases. The fourth variable is the correlation coefficient (Pearson’s or Spearman’s) between the expression values of the two genes corresponding to the given proteins. These correlations are calculated based on the user-provided expression data, whereas the former three variables are integrated into MAGNET. Expression data is the exception because expression of an interacting pair may vary greatly depending on the samples chosen.By integrating the first three variables from public databases with the fourth variable from tissue-specific gene/protein expression measurements, MAGNET effectively allows researchers to harness the power of these datasets while still obtaining results specific to their experiment. Suthram et al. (23) have evaluated various models used to assess the quality of interaction confidence assignment schemes. It was reported that a similar logistic regression model (without co-localization) performs better than others in correlating functional assignments of proteins. Hence, MAGNET utilizes a logistic regression model, since our focus is on assessing the validity of functional relationships of protein pairs in a given system.
Web Server
To submit a job to score a PPI network, the user must upload normalized expression data and the platform definition files (available from GEO) describing the genes targeted in that platform. The user can also select the variables used in the logistic regression model (Figure 2). After submitting the expression data, the user can filter the samples by the ‘sample_characteristic_ch’ fields in the GSE file (Step 2). This allows the user to include or exclude specific samples without having to manually edit the files. Finally, the user is presented with the console output and can then proceed to the Results page, where the resulting PPI network can be viewed as a network with a web-based network viewer, an edge list table, or downloaded as Cytoscape readable files for further analysis.
Figure 2.
Workflow for Module 1: Weighting protein–protein interaction networks. In step 1, the user specifies the organism taxonomy, adjusts the variables included in the logistic regression model, and uploads the gene expression data in GEO-compatible format. In step 2, the user can filter the samples based on the sample characteristics and annotation. After processing, the weighted PPI network is available in Cytoscape-compatible format, an interactive web-based network viewer for visualization and as a browser table with links to external source databases.
Workflow for Module 1: Weighting protein–protein interaction networks. In step 1, the user specifies the organism taxonomy, adjusts the variables included in the logistic regression model, and uploads the gene expression data in GEO-compatible format. In step 2, the user can filter the samples based on the sample characteristics and annotation. After processing, the weighted PPI network is available in Cytoscape-compatible format, an interactive web-based network viewer for visualization and as a browser table with links to external source databases.
Generating coexpression networks
Method
Analysis of coexpression relationships between genes/proteins provides insights into their interactions and functions. MAGNET can quickly and easily generate coexpression networks for a given set of genes where genes are represented by nodes, and edges connect genes whose coexpression correlation (Pearson’s or Spearman’s) is above a certain cut-off value (Figure 3). Although similar tools are available, they require specific platforms (e.g. R or Matlab) and do not provide easy access to visualization tools (4,24). This module works independently of the PPI module, and it generates networks that quantify the pair-wise correlation of the genes in a given network, i.e., high correlation values reflecting coexpression and negative correlations reflecting differential expression.
Figure 3.
Workflow for Module 2: Generation of coexpression networks. In step 1, the user uploads selected genes (or leaves it blank if interested in all genes in the array), specifies the coexpression cutoffs (if filtering is desired), specifies the type of correlation measure, and uploads the gene expression data in GEO-compatible format. Then, the user is taken to step 2, where the samples can be filtered by their annotation and sample characteristics. After processing, the output consists of tabulated results, a spreadsheet of correlation values, Cytoscape .sif and .eda files, and an interactive web-based network viewer for visualization.
Workflow for Module 2: Generation of coexpression networks. In step 1, the user uploads selected genes (or leaves it blank if interested in all genes in the array), specifies the coexpression cutoffs (if filtering is desired), specifies the type of correlation measure, and uploads the gene expression data in GEO-compatible format. Then, the user is taken to step 2, where the samples can be filtered by their annotation and sample characteristics. After processing, the output consists of tabulated results, a spreadsheet of correlation values, Cytoscape .sif and .eda files, and an interactive web-based network viewer for visualization.The form to generate a coexpression network is similar to that of the PPI network module, except that the user can also specify cut-offs to filter the edges in the resulting coexpression network. After job submission, the user is presented with the console output and can then proceed to a Results page similar to that of the PPI network module.
SOFTWARE DOCUMENTATION
MAGNET provides both an online manual and a full tutorial. In this manual, the workflow of MAGNET is explained step-by-step. Additional pictures and screenshots can guide the user who wants to understand the details and to tune the parameters available for MAGNET users. For testing purposes only, MAGNET provides an exemplary expression data file that can easily be used by selecting sample data during the first step of the wizard in each module. By doing so, users find a quick way to examine MAGNET’s features. The data provided are publicly available from GEO (GSE 19338 [19]) and represent gene expression profiling experiments run from villus and crypt layers of murine intestine. The series includes profiles from wild-type mice and mice that have mutations in adenomatous polyposis coli (APC) and p21 genes. These data were normalized using robust multi-array (RMA) normalization and uploaded with detailed annotations for future filtering steps. We use a gene set of 11 genes as an example to test the two processes, although MAGNET does not limit the number of genes in a given job.
CONCLUSION
MAGNET allows users to both score PPI networks by integrating four different diverse data types and to generate coexpression networks given expression profiles. All modules are developed for expression data formatted in the GEO SOFT format, but the site contains templates for non-GEO data as well. The site is optimized to work with large datasets with ease, preventing the user from having to deal with cumbersome arrays prior to analysis. The tool is freely available to researchers and can be accessed with any up-to-date web browser available.
FUNDING
National Institutes of Health (NIH) [P30-CA043703 to M.R.C. and UL1-, BB123456 to M.R.C.]. Funding for open access charge: NIH.Conflict of interest statement. None declared.
Authors: Ronald Jansen; Haiyuan Yu; Dov Greenbaum; Yuval Kluger; Nevan J Krogan; Sambath Chung; Andrew Emili; Michael Snyder; Jack F Greenblatt; Mark Gerstein Journal: Science Date: 2003-10-17 Impact factor: 47.728
Authors: Balaji S Srinivasan; Nigam H Shah; Jason A Flannick; Eduardo Abeliuk; Antal F Novak; Serafim Batzoglou Journal: Brief Bioinform Date: 2007-08-29 Impact factor: 11.622
Authors: Tanya Barrett; Dennis B Troup; Stephen E Wilhite; Pierre Ledoux; Dmitry Rudnev; Carlos Evangelista; Irene F Kim; Alexandra Soboleva; Maxim Tomashevsky; Kimberly A Marshall; Katherine H Phillippy; Patti M Sherman; Rolf N Muertter; Ron Edgar Journal: Nucleic Acids Res Date: 2008-10-21 Impact factor: 16.971
Authors: Andrey Alexeyenko; Thomas Schmitt; Andreas Tjärnberg; Dmitri Guala; Oliver Frings; Erik L L Sonnhammer Journal: Nucleic Acids Res Date: 2011-11-21 Impact factor: 16.971
Authors: J Michael Cherry; Eurie L Hong; Craig Amundsen; Rama Balakrishnan; Gail Binkley; Esther T Chan; Karen R Christie; Maria C Costanzo; Selina S Dwight; Stacia R Engel; Dianna G Fisk; Jodi E Hirschman; Benjamin C Hitz; Kalpana Karra; Cynthia J Krieger; Stuart R Miyasato; Rob S Nash; Julie Park; Marek S Skrzypek; Matt Simison; Shuai Weng; Edith D Wong Journal: Nucleic Acids Res Date: 2011-11-21 Impact factor: 16.971
Authors: Olga G Troyanskaya; Kara Dolinski; Art B Owen; Russ B Altman; David Botstein Journal: Proc Natl Acad Sci U S A Date: 2003-06-25 Impact factor: 12.779
Authors: Kanthida Kusonmano; Mari K Halle; Elisabeth Wik; Erling A Hoivik; Camilla Krakstad; Karen K Mauland; Ingvild L Tangen; Anna Berg; Henrica M J Werner; Jone Trovik; Anne M Øyan; Karl-Henning Kalland; Inge Jonassen; Helga B Salvesen; Kjell Petersen Journal: PLoS One Date: 2018-11-01 Impact factor: 3.240