Literature DB >> 18448467

MassNet: a functional annotation service for protein mass spectrometry data.

Daeui Park¹, Byoung-Chul Kim, Seong-Woong Cho, Seong-Jin Park, Jong-Soon Choi, Seung Il Kim, Jong Bhak, Sunghoon Lee.

Abstract

Although mass spectrometry has been frequently used to identify proteins, there are no web servers that provide comprehensive functional annotation of those identified proteins. It is necessary to provide such web service due to a rapid increase in the data. We, therefore, introduce MassNet, which provides (i) physico-chemical analysis information, (ii) KEGG pathway assignment (iii) Gene Ontology mapping and (iv) protein-protein interaction (PPI) prediction for the data from MASCOT, Prospector and Profound. MassNet provides the prediction information for PPIs using both 3D structural interaction and experimental interaction deposited in PSIMAP, BIND, DIP, HPRD, IntAct, MINT, CYGD and BioGrid. The web service is freely available at http://massnet.kr or http://sequenceome.kobic.re.kr/MassNet/.

Entities: Chemical Gene Species

Mesh：

Substances：
Proteins

Year: 2008 PMID： 18448467 PMCID： PMC2447811 DOI： 10.1093/nar/gkn241

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Mass spectrometry (MS) is the key method for proteomics (1). MS is widely used to study complex cellular proteomes and low abundance proteins (1–3). With it we can rapidly identify proteins and obtain information for protein complexes and posttranslational modification (3). MS data are used to produce genome-scale data (4). Presently, the functional annotation of MS data often requires researchers to navigate numerous web-accessible primary data servers. In order to analyze large-scale data, one approach is to provide access to an integrated web server that contains rich bio-information with graphic interfaces (5). Several MS data processing systems have been developed to handle these challenges. They are MASCOT (http://www.matrixscience.com) (2), Prospector (http://prospector.ucsf.edu) and Profound (http://prowl.rockefeller.edu) (6). These systems provide protein identification data using public databases such as SwissProt (http://www.ebi.ac.uk/swissprot) and NCBInr (http://www.ncbi.nlm.nih.gov). These web services do not include the functional annotation of MS data and do not supply the latest version of the analysis tools. To provide an easy and automated pipeline for functional annotation of given MS results, we constructed a web-based server, MassNet. The use of MassNet does not require any application installation and it is easy to use.

METHODS

To analyze MS data, various protein annotation resources are required. Therefore, we integrated major protein sequence databases, protein–protein interaction (PPI) databases, Gene Ontology (GO) (http://www.geneontology.org) (7), KEGG pathway (http://www.genome.jp/kegg) (8) and bioinformatics analysis tools such as SignalP. This system has four major parts: (i) a nonredundant protein database, (ii) a physico-chemical property analysis module, (iii) a function annotation module and (iv) a PPI prediction module. A schematic workflow of MassNet is shown in Figure 1.

Figure 1.

The schematic workflow of MassNet.

Construction of the nonredundant protein database

In order to identify proteins from MS data, researchers use various protein sequence databases such as NCBInr, SwissProt and trEMBL. However, there can be confusion among protein identifiers. Because of this problem, all protein identifiers were relationally linked. We integrated the protein sequence databases (Swiss-Prot, trEMBL, NCBInr, RefSeq, Ensembl and IPI) using only perfect-matching sequences. The database unifies protein IDs of the same sequence, summarizes annotations and descriptions of proteins from a range of organisms representing all three major kingdoms of life: eukaryotes, prokaryotes and viruses. Therefore, the root identifier (Sequenceome_ID) can contain several protein identifiers from all available databases. The Sequenceome_ID database is a nonredundant sequence database of 6 856 434 proteins (April 2008).

Analysis of physico-chemical properties of proteins

The physico-chemical properties of MS data are important to understand biological functions. Especially, the prediction of hydropathy and subcelular localization of MS data is closely related to find membrane proteins which are involved in cellular processes and protein classes as drug targets (9). We used modules from Biopython (http://biopython.org) (10) to calculate hydropathy profile, GRAVY score (the average hydropathy score for all the amino acids), protein length, molecular weight, amino acid distribution, isoelectric point and protein instability index (11). For the subcellular localization prediction, we predicted transmembrane helices and signal peptides using Phobius (http://phobius.sbc.su.se) (12) and SignalP 3.0 (http://www.cbs.dtu.dk/services/SignalP) (13) programs. In order to provide physico-chemical information without any time delay, we provide precalculated physico-chemical properties for all nonredundant protein sequences. Whole proteins’ physico-chemical properties are also provided as summary tables or figures. If a set of proteins was input, the user can acquire information on the protein set's physico-chemical distribution against whole-protein distribution of the organism. If the identified protein set was from the membrane fraction of an organism, the user compares the relative transmembrane protein abundances between the organism's whole-protein set and the identified protein set. Therefore, this summary information can be used to evaluate the input data quality.

Integration of annotation information

MassNet provides biological function information by using KEGG pathways and GO. The KEGG pathway database and GO represent an attempt to assign known proteins into known biological pathways and are updated regularly (8). MassNet assigns proteins to KEGG pathways thorough ID mapping and shows color-coded proteins in the context of biochemical pathway maps using KEGG API. In order to find significant associations of GO terms with queried proteins, we assigned proteins into GO categories and GO-slim (14) through ID mapping. In order to gain more accurate statistical test results of KEGG and GO assignment, we added Fisher's exact test algorithm (P-value).

Prediction of PPI

The prediction of PPI is based on PSIMAP (protein structural interactome MAP) (http://psimap.com, http://psibase.kobic.re.kr) (15,16) and PEIMAP (protein experimental interactome MAP) (17). The basic algorithm of PSIMAP infers interactions among proteins by using their homologs. Interactions among domains or proteins for known PDB (Protein Data Bank) (http://www.rcsb.org/pdb) structures are the basis of the predictions. If an unknown protein has a homolog to a domain, PSIMAP assumes that the query tends to interact with its homolog's partners. Its concept is called ‘homologous interaction’ (18–20). The original interaction between two proteins or domains is based on the Euclidean distance. Therefore, PSIMAP gives a structure-based interaction prediction (15). On the other hand, PEIMAP is a well-established method that uses public resources of experimentally confirmed protein interaction information such as BIND (http://bond.unleashedinformatics.com) (21), DIP (http://dip.doe-mbi.ucla.edu) (22), IntAct (http://www.ebi.ac.uk/intact) (23), MINT (http://mint.bio.uniroma2.it/mint) (24), HPRD (http://www.hprd.org) (25), CYGD (http://mips.gsf.de/genre/proj/yeast) (26) and BioGrid (http://www.thebiogrid.org) (27). We constructed a nonredundant PPI database from the source databases. We carried out a redundancy check to remove identical protein sequences from the source interaction databases using PERL (http://www.perl.org). Now, it contains 116 773 proteins and 229 799 interactions. The accuracy of PEIMAP is dependent on the confidence of each resource. In order to reduce the false positive rate of PEIMAP, we computed the final ‘combined score’ for each pair of proteins which were predicted by PEIMAP and PSIMAP algorithms. This scoring methodology has been proposed by published articles including the STRING server (http://string.embl.de) (28). Users can easily predict PPI for queried proteins in a list and can examine PPIs with a network viewer.

USER INTERFACE

Input

The query interface allows the user to submit an HTML file from the mass spectrometry or a TAB-delimited text file. The tab-delimited file must contain protein names in the first column. Detailed information about the TAB-delimited file format is described on the ‘HOW TO USE’ page. MassNet can use four types of MS data formats, i.e. MASCOT, Prospector, Profound and TAB-delimited file.

Output

After uploading the query file, users can obtain the annotation information as in Figure 2a. The annotation results consist of five parts: (i) a protein list page, (ii) the physico-chemical property of each protein, (iii) a PPI prediction page, (iv) a KEGG pathway page and (v) a GO page.

Figure 2.

Screenshots of MassNet annotation results. (a) Panel in the middle is the protein list table. KEGG Pathway tab shows KEGG pathway assignment and metabolic pathway graph (right panels). Gene Ontology tab shows proteins assigned to GO categories (left top panel). Chemical Statistics tab shows the input protein set's physico-chemical distribution against whole protein distribution of the organism (left bottom panels). (b) Protein-protein interactions of user-selected proteins are visualized by a network viewer. Rectangular shapes are protein nodes. The black connecting lines indicate interactions among the nodes. The two red rectangular nodes are proteins that are selected by the users through the right hand side panel. When users select the right pull down menus in the right panel, the left drawing canvas shows highlighted protein nodes. The protein list page shows a table describing protein names and scores, which are parsed from the query file. The KEGG pathway and the GO pages show the number of proteins, which belong to the categories of KEGG pathways and GO. By clicking the ‘Run PPI Prediction’ button at the top of the protein list table, the user can acquire the PPI information for selected proteins. The PPI page shows PEIMAP and PSIMAP (see Methods section) data at two separated tables. By clicking Sequenceome_IDs at all pages, users can access two pages, i.e. a Same IDs page and a Chemical Property page. The same IDs page shows the identical sequences at various protein sequence databases and provides the hyperlinks to original database web pages. In order to provide clear information, MassNet provides a viewer for PPI networks as in Figure 2b.

IMPLEMENTATION

The MassNet web server runs on a Linux server. It combines a MySQL (http://www.mysql.com) database with a dynamic web interface using Java Server Pages (http://java.sun.com/products/jsp). Data preprocessing is implemented in Perl and Python, and the network viewer for PPI was constructed using Java.

CONCLUSION

The functional analysis and interpretation of the large-scale MS data are still a challenging task. An automatic approach is necessary for tens of thousands of MS data collected throughout the world. MassNet is the first web server that provides various kinds of functional information, such as physico-chemical properties, biological pathways, gene ontology and PPI, for MS data. MassNet is easy to use and provides information through an automatic annotation for queried proteins.

27 in total

1. KEGG: kyoto encyclopedia of genes and genomes.

Authors: M Kanehisa; S Goto
Journal: Nucleic Acids Res Date: 2000-01-01 Impact factor: 16.971

2. Protein interaction verification and functional annotation by integrated analysis of genome-scale data.

Authors: Patrick Kemmeren; Nynke L van Berkum; Jaak Vilo; Theo Bijma; Rogier Donders; Alvis Brazma; Frank C P Holstege
Journal: Mol Cell Date: 2002-05 Impact factor: 17.970

3. Protein interactions: two methods for assessment of the reliability of high throughput observations.

Authors: Charlotte M Deane; Łukasz Salwiński; Ioannis Xenarios; David Eisenberg
Journal: Mol Cell Proteomics Date: 2002-05 Impact factor: 5.911

4. A combined transmembrane topology and signal peptide prediction method.

Authors: Lukas Käll; Anders Krogh; Erik L L Sonnhammer
Journal: J Mol Biol Date: 2004-05-14 Impact factor: 5.469

5. PSIbase: a database of Protein Structural Interactome map (PSIMAP).

Authors: Sungsam Gong; Giseok Yoon; Insoo Jang; Dan Bolser; Panos Dafas; Michael Schroeder; Hansol Choi; Yoobok Cho; Kyungsook Han; Sunghoon Lee; Hwanho Choi; Michael Lappe; Liisa Holm; Sangsoo Kim; Donghoon Oh; Jonghwa Bhak
Journal: Bioinformatics Date: 2005-03-03 Impact factor: 6.937

MassNet: a functional annotation service for protein mass spectrometry data.

INTRODUCTION

METHODS

Construction of the nonredundant protein database

Analysis of physico-chemical properties of proteins

Integration of annotation information

Prediction of PPI

USER INTERFACE

Input

Output

IMPLEMENTATION

CONCLUSION

1. KEGG: kyoto encyclopedia of genes and genomes.

2. Protein interaction verification and functional annotation by integrated analysis of genome-scale data.

3. Protein interactions: two methods for assessment of the reliability of high throughput observations.

4. A combined transmembrane topology and signal peptide prediction method.

5. PSIbase: a database of Protein Structural Interactome map (PSIMAP).

6. Locating proteins in the cell using TargetP, SignalP and related tools.

Review 7. MINT: a Molecular INTeraction database.

8. A simple method for displaying the hydropathic character of a protein.

9. The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology.

10. BioGRID: a general repository for interaction datasets.

1. A comprehensive protein-centric ID mapping service for molecular data integration.

2. Pathway Palette: a rich internet application for peptide-, protein- and network-oriented analysis of MS data.

3. OntoSlug: a dynamic visual front-end program for ontologies.

4. PutidaNET: interactome database service and network analysis of Pseudomonas putida KT2440.