Literature DB >> 25721607

Web resources for mass spectrometry-based proteomics.

Tao Chen¹, Jie Zhao², Jie Ma¹, Yunping Zhu³.

Abstract

With the development of high-resolution and high-throughput mass spectrometry (MS) technology, a large quantum of proteomic data is continually being generated. Collecting and sharing these data are a challenge that requires immense and sustained human effort. In this report, we provide a classification of important web resources for MS-based proteomics and present rating of these web resources, based on whether raw data are stored, whether data submission is supported, and whether data analysis pipelines are provided. These web resources are important for biologists involved in proteomics research.

Entities: Chemical Disease Species

Keywords: Mass spectrometry; Proteomics; Web resources

Mesh：

Year: 2015 PMID： 25721607 PMCID： PMC4411487 DOI： 10.1016/j.gpb.2015.01.004

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 7.691

Introduction

The advancement of tandem mass spectrometry (MS) has made it possible to identify hundreds of thousands of proteins in MS-based experiments [1]. With the development of a wide range of methods for spectrometry and data analysis, MS-based proteomics has gained popularity in biomedical research. The vastly-expanding research using tandem MS technology is continually generating large amounts of proteomics data. Collecting these datasets is undoubtedly becoming crucial to the research community. Proteomics data repository contains a proteome with high coverage and sufficient data content for statistical analysis, and provides extensive observational data for genome annotation projects as well. However, maintaining such data repository is challenging due to the diversity and quantum of data as well as varying needs of different users. In this report, we describe web data repositories for MS-based proteomics and rate them based on their score against parameters such as storage of raw data, data submission support, and provision of data analysis pipelines. The main features of these resources are shown in Table 1. Based on their focus areas within proteomic research, we classified these resources into 3 categories: general proteomics data repositories, quantitative proteomics data repositories, and proteomics data repositories focusing on protein post-translational modifications (PTMs).

Table 1

List of major MS-based proteomics resources

Category	Name	Link	Main features	Rating	Refs.
General	PRIDE	http://www.ebi.ac.uk/pride/archive	Supports raw data storage and data submission	★★★★★	[3]
	PeptideAtlas	http://www.peptideatlas.org	Supports raw data storage, data submission, and data analysis	★★★★★	[1]
	Human Proteinpedia	http://www.humanproteinpedia.org	Supports raw data storage and data submission	★★★★☆	[4]
	iProX	http://iprox.hupo.org.cn	Supports raw data storage, data submission, and data analysis	★★★★★
	Tranche	https://proteomecommons.org/tranche	Supports raw data storage and data submission	★★★☆☆	[5]
	GPMDB	http://www.thegpm.org	Supports data analysis	★★★☆☆	[6]
	MOPED	http://moped.proteinspire.org	Stores protein expression information from MS-based proteomics experiments	★★★☆☆	[7]
	YPED	http://yped.med.yale.edu	An integrated bioinformatics suite and database for proteomics research	★★★★☆	[8], [9]

Quantitative PTMs-focused	PaxDb	http://pax-db.org	Supports quantitative proteomics data storage	★★★☆☆	[10]
	Phospho.ELM	http://phospho.elm.eu.org	Supports phosphoproteomic MS data storage	★★★☆☆	[11]
	PhosphoSitePlus	http://www.phosphosite.org	Stores raw data and MS-reported PTM sites	★★★★☆	[12]
	dbPTM	http://dbptm.mbc.nctu.edu.tw	Stores raw data and MS/MS peptides associated with PTMs	★★★★☆	[13], [14]
	PHOSIDA	http://www.phosida.com	Supports raw data storage and phosphoproteomic MS data storage	★★★★☆	[15], [16]

Note: These web resources are rated based on their score against parameters such as storage of raw data, data submission support, and provision of data analysis pipelines. MS, mass spectrometry; PTM, post-translational modification.

List of major MS-based proteomics resources Note: These web resources are rated based on their score against parameters such as storage of raw data, data submission support, and provision of data analysis pipelines. MS, mass spectrometry; PTM, post-translational modification.

General proteomics data repositories

Proteomics IDEntifications database

The Proteomics IDEntifications (PRIDE) database created by the European Bioinformatics Institute (EBI) is a web resource that collects MS-based proteomics data. By the end of 2014, PRIDE accumulated data for 41,835 proteins, 269,806 unique peptides, and about 101 million spectra [2]. PRIDE is one of the most popular proteomic data repositories that have played an important role in the nascent Human Proteome Project (HPP) [3].

PeptideAtlas

PeptideAtlas is a database that stores various formats of output files and metadata from MS-based experiments [1], it also allows users to submit raw data. These raw data are periodically analyzed for identification and statistical analysis purposes. The results are made available back to the researchers by web-based presentation systems. PeptideAtlas can help plan targeted proteomics experiments, improve genome annotation, and support data mining projects [1].

Human Proteinpedia

Human Proteinpedia is a resource to integrate, store, and share proteomic data [4]. It is a platform for collecting human proteomic data using a distributed annotation system, which allows the research community to contribute protein annotations. By the end of 2014, Human Proteinpedia has covered 15,231 proteins, 1,960,352 peptides, and about 5 million spectra [2]. It also provides a panorama of the human proteome.

iProX

iProX is an integrated proteome resources center based in China, which is built to support the worldwide sharing of proteomics data. Currently, iProX comprises an experiment data submission system and a proteome database. The iProX submission system is a public platform that was set up following the data-sharing policy of the ProteomeXchange consortium. Raw data and standardized meta-data from proteomics experiments can be collected and shared by using controlled vocabularies to describe the Minimum Information About a Proteomics Experiment (MIAPE). Registered users can choose to submit their proteomics datasets to iProX via public or private modes. Datasets submitted via the public mode are openly accessible, whereas private datasets can only be accessed by the authorized users. On the other hand, the iProX proteomics database was developed as a structured storage platform for data deposited in the system. iProX facilitates data analysis and sharing. Up till now, it has covered 46 projects, 190 subprojects, and 6441 data files.

Tranche

Tranche is a data repository targeting storage and sharing of information for proteomics researchers. It supports re-use and dissemination of both data and software. To reduce data redundancy and achieve load balancing, it adopts peer-to-peer networking. It also uses a client–server model to ensure authentication and reliability. A client tool is required to upload and download datasets. It has several important features including pre-publication encryption, data pedigree, data integrity, immutability, and versioning. Tranche provides interfaces for PRIDE, Human Proteinpedia, and PeptideAtlas to store and disseminate large MS-based data files [5].

Global Proteome Machine Database

The Global Proteome Machine Database (GPMDB) is a resource for collecting diverse tandem mass spectra. It also includes peptide and protein identifications that are important for further MS computational research [6]. GPMDB provides a pipeline for reprocessing raw data submitted by users or imported from other repositories, thus generating XML files that store information about peptide and protein identification. Specifically, identified proteins are organized into separate spreadsheets for each chromosome and mitochondrial DNA. By the end of 2014, GPMDB data spans 136,373 proteins, 1,786,698 peptides, and 1020 million spectra [2]. GPMDB has played an important role in the Chromosome-Centric Human Proteome (C-HPP) Project.

Model Organism Protein Expression Database

The Model Organism Protein Expression Database (MOPED) is a proteomics repository that integrates protein expression information from MS-based proteomics experiments on human specimens and that from model organisms [7]. It also provides new estimates of protein abundance and concentration, and statistical summaries from experiments. Several search and visualization tools are available. By the end of 2014, MOPED has developed into a repository containing 17,141 proteins, 250,000 unique peptides, and approximately 15 million spectra [2], providing researchers with information on complex biological processes and thus supporting biomedical discovery.

Yale Protein Expression Database

The Yale Protein Expression Database (YPED) [8] is an integrated bioinformatics suite and database for proteomics research, which was significantly improved from the first version released in 2007 [9]. YPED supports many kinds of data including those from multiple MS instruments, different search engines, and labeled or label-free quantification. YPED is a web-accessible and user-friendly resource, designed to meet data management, archival, and analysis needs of high-throughput MS-based proteomics research.

Quantitative proteomics data repositories

PaxDb

PaxDb is a meta-database integrating whole-organism data and tissue-resolved data at absolute protein abundance levels for various model organisms. It imports quantitative proteomics data sets exclusively from published experiments and from primary proteomics data resources such as PRIDE and PeptideAtlas, and then analyzes the actual spectral count [10]. By the end of 2014, it included 10,482 proteins; 143,456 peptides, and about 24 million spectra [2]. The launch of PaxDb brings together disparate aspects of biology for high-throughput analysis and supports global comparative analysis across different organism groups.

Proteomics data repositories focusing on protein PTMs

Phospho.ELM

Recent advances in MS techniques have enabled more efficient detection of phosphorylated proteins [9]. The Phospho.ELM is a web-based resource aimed at storing phosphorylation data imported from research papers and phosphoproteomic MS analyses. MS experiments are run on human/mouse cell lines/tissues. Phospho.ELM is used by laboratory scientists and computational biologists to develop public repositories [11]. To date, this web resource covers 42,914 instances, 299 kinases, 3657 references, 11,224 sequences, and 8698 substrates.

PhosphoSitePlus

PhosphoSitePlus (PSP) is a comprehensive and manually-curated resource designed to collect the structure and function of PTMs, primarily of human and mouse origin. PSP supports two kinds of data, including the modified amino acid and surrounding sequences as well as upstream and downstream interactions with regard to functional regions of the protein [12]. The majority of PTM sites in PSP were detected using MS. PSP is useful to life scientists and biomedical researchers. Currently, PSP spans 50,636 proteins, 1,933,888 MS peptides, 438,576 high-throughput MS sites, 20,262 low-throughput sites, and 18,374 curated papers.

dbPTM

dbPTM is a resource which collects data on experimentally-validated protein PTMs. This resource imports PTM sites from public resources such as SwissProt, Phospho.ELM, and O-GLYCBASE [13]. It also extracts identified peptides with PTMs from research papers. dbPTM is an important resource for researchers working on substrate specificity of PTM sites [14]. To date, dbPTM has covered 153,113 phosphorylation experimental sites, 23,673 ubiquitylation experimental sites, 10,385 acetylation experimental sites, 15,678 N-linked glycosylation experimental sites, and 3711 O-linked glycosylation experimental sites.

Phosphorylation site database

The phosphorylation site database (PHOSIDA) is a database with a collection of a large number of high-confidence phosphorylation sites. MS-based proteomics is used to identify these sites in various species [15]. To date, the database covers 80,062 N-glycosylated, phosphorylated, or acetylated sites. Stringent quality criteria based on a very low false positive rate are used to obtain these sites from high-resolution MS data [16]. PHOSIDA contains PTM sites from human as well as other species, including bacteria.

Concluding remarks

In this report, we have covered some important proteomics data repositories that are useful for the research community. These resources not only provide raw data and identification results, but also support prospective, high-throughput proteomics research. In addition, they also act as data providers for large-scale genome annotation efforts. In the years to come, sharing data and metadata between repositories will become more important. Thus, proteomics repositories need to focus on developing an integrated approach to data accessibility between repositories. On the other hand, with the advent of new instruments, new sample preparation techniques, and new data analysis methods, new forms of data will be continuously generated. The amount of data in the repositories to be shared at present is just a small fraction of the actually-generated proteomics data that will eventually become available. In order to attract more researchers to submit data, the resources will have to standardize the process and simplify the interface for data submission.

Competing interests

The authors declared that there are no competing interests.

16 in total

1. Tranche distributed repository and ProteomeCommons.org.

Authors: Bryan E Smith; James A Hill; Mark A Gjukich; Philip C Andrews
Journal: Methods Mol Biol Date: 2011

2. Open source system for analyzing, validating, and storing protein identification data.

Authors: Robertson Craig; John P Cortens; Ronald C Beavis
Journal: J Proteome Res Date: 2004 Nov-Dec Impact factor: 4.466

3. YPED: a web-accessible database system for protein expression analysis.

Authors: Mark A Shifman; Yuli Li; Christopher M Colangelo; Kathryn L Stone; Terence L Wu; Kei-Hoi Cheung; Perry L Miller; Kenneth R Williams
Journal: J Proteome Res Date: 2007-09-15 Impact factor: 4.466

4. The PeptideAtlas Project.

Authors: Eric W Deutsch
Journal: Methods Mol Biol Date: 2010

5. PHOSIDA 2011: the posttranslational modification database.

Authors: Florian Gnad; Jeremy Gunawardena; Matthias Mann
Journal: Nucleic Acids Res Date: 2010-11-16 Impact factor: 16.971

6. Phospho.ELM: a database of phosphorylation sites--update 2011.

Authors: Holger Dinkel; Claudia Chica; Allegra Via; Cathryn M Gould; Lars J Jensen; Toby J Gibson; Francesca Diella
Journal: Nucleic Acids Res Date: 2010-11-09 Impact factor: 16.971

7. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse.

Authors: Peter V Hornbeck; Jon M Kornhauser; Sasha Tkachev; Bin Zhang; Elzbieta Skrzypek; Beth Murray; Vaughan Latham; Michael Sullivan
Journal: Nucleic Acids Res Date: 2011-12-01 Impact factor: 16.971

8. YPED: an integrated bioinformatics suite and database for mass spectrometry-based proteomics research.

Authors: Christopher M Colangelo; Mark Shifman; Kei-Hoi Cheung; Kathryn L Stone; Nicholas J Carriero; Erol E Gulcicek; TuKiet T Lam; Terence Wu; Robert D Bjornson; Can Bruce; Angus C Nairn; Jesse Rinehart; Perry L Miller; Kenneth R Williams
Journal: Genomics Proteomics Bioinformatics Date: 2015-02-21 Impact factor: 7.691

9. Human Proteinpedia: a unified discovery resource for proteomics research.

Authors: Kumaran Kandasamy; Shivakumar Keerthikumar; Renu Goel; Suresh Mathivanan; Nandini Patankar; Beema Shafreen; Santosh Renuse; Harsh Pawar; Y L Ramachandra; Pradip Kumar Acharya; Prathibha Ranganathan; Raghothama Chaerkady; T S Keshava Prasad; Akhilesh Pandey
Journal: Nucleic Acids Res Date: 2008-10-23 Impact factor: 16.971

10. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites.

Authors: Florian Gnad; Shubin Ren; Juergen Cox; Jesper V Olsen; Boris Macek; Mario Oroshi; Matthias Mann
Journal: Genome Biol Date: 2007 Impact factor: 13.583

6 in total