Literature DB >> 27899654

jPOSTrepo: an international standard data repository for proteomes.

Shujiro Okuda¹, Yu Watanabe², Yuki Moriya³, Shin Kawano³, Tadashi Yamamoto⁴, Masaki Matsumoto⁵, Tomoyo Takami⁵, Daiki Kobayashi⁶, Norie Araki⁶, Akiyasu C Yoshizawa⁷, Tsuyoshi Tabata⁸, Naoyuki Sugiyama⁸, Susumu Goto⁷, Yasushi Ishihama⁹.

Abstract

Major advancements have recently been made in mass spectrometry-based proteomics, yielding an increasing number of datasets from various proteomics projects worldwide. In order to facilitate the sharing and reuse of promising datasets, it is important to construct appropriate, high-quality public data repositories. jPOSTrepo (https://repository.jpostdb.org/) has successfully implemented several unique features, including high-speed file uploading, flexible file management and easy-to-use interfaces. This repository has been launched as a public repository containing various proteomic datasets and is available for researchers worldwide. In addition, our repository has joined the ProteomeXchange consortium, which includes the most popular public repositories such as PRIDE in Europe for MS/MS datasets and PASSEL for SRM datasets in the USA. Later MassIVE was introduced in the USA and accepted into the ProteomeXchange, as was our repository in July 2016, providing important datasets from Asia/Oceania. Accordingly, this repository thus contributes to a global alliance to share and store all datasets from a wide variety of proteomics experiments. Thus, the repository is expected to become a major repository, particularly for data collected in the Asia/Oceania region.

Entities: Chemical Disease Species

Mesh：

Substances：
Proteome

Year: 2016 PMID： 27899654 PMCID： PMC5210561 DOI： 10.1093/nar/gkw1080

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Recent improvements in mass spectrometry (MS) have yielded large amounts of proteomics data (1). In order to ensure reliability of the data and the capacity for re-analysis in the future, it is necessary to construct a high-quality public data repository for promising datasets (2), similar to the public data repositories for DNA sequences (3,4) and gene expression profiles (5,6). However, such large-scale repositories can be difficult to maintain. Indeed, some public data repositories for proteomes, e.g. Peptidome (7) and Tranche (8), have been closed owing to issues with the sustainability of database maintenance. The databases PRoteomics IDEntifications (PRIDE) (9), mass spectrometry Interactive Virtual Environment (MassIVE, http://massive.ucsd.edu/) and PeptideAtlas SRM Experiment Libraries (PASSEL) (10) are currently available public data repositories. The ProteomeXchange (PX) consortium was proposed in 2008 and developed over several years, leading to a publication with substantial use information in 2014 (1). PX provides coordinated submission of MS datasets for proteomics to these three proteomics data repositories. However, data transfer to current submission points of ProteomeXchange, such as PRIDE in Europe and MassIVE in the USA for MS/MS datasets and PASSEL in the USA for SRM datasets via the Internet from Asia/Oceania is usually very slow and highly troublesome. Moreover, the size of MS proteomic dataset files to be deposited is likely to be very large, e.g. >100GB, requiring upload times of more than tens of hours from Japan. Such a slow transmission may be due to latency or delays before the actual data transfer. Generally, uploading huge files to a distant file server via the Internet occurs with latency and makes the net transfer speed very slow. To avoid this problem, web services for accelerating transfer speed, such as the Aspera (http://asperasoft.com/) file transfer protocol, are often used; however, these services are expensive and can have disadvantages with regard to the sustainability of data repository maintenance. Additionally, with these types of software, users are required to install specialized software on their computers, imposing a load on users. Here, we introduce a new public repository, jPOSTrepo (Japan ProteOme STandard Repository), which is an international standard data repository for proteomes. We successfully developed a high-speed file upload system and user-interface with open-source libraries; all the submission operations can be completed within a web-browser. In addition, the repository provides the functions to facilitate data input for details of wet experimental protocols. The repository is expected to have a dramatic increase in the number of deposits. Furthermore, the repository will contribute to improving the sustainability of the PX consortium by receiving some of the increasing data deposits in worldwide, which is currently performed by other PX repositories located only in Europe and the USA (11). Our repository, like other PX repositories, will contribute to proteome data sharing among worldwide researchers.

DATABASE DESCRIPTION

jPOSTrepo is a public data repository for datasets obtained from proteomics experiments. Although users are required to sign up for the repository to upload and manage their datasets, once the depositor makes the dataset public, data are available to be downloaded without requiring a sign up process. Figure 1 shows the management of deposited data in our repository. Generally, researchers use different experimental methods for different experiments, although a single researcher usually uses a limited number of experimental methods. Thus, our repository also manages information related to the experimental procedures as a ‘preset’ and information specific to each experiment as a ‘project’. For the deposit of datasets, users are required to register their experimental details as presets. After the presets are registered, users create a project to deposit datasets and apply the configured presets to proteome data files.

Figure 1.

Diagram of a schematic model of the jPOSTrepo file management system.

Presets for experimental procedures

jPOSTrepo describes proteomics-related biological experimental procedures as four different types of presets: Sample, Fractionation, Enzyme/Modifications and MS mode. Registration of experimental conditions as presets can allow users to easily reuse data when they create other projects using the same experimental conditions. A researcher in one laboratory may use the same cell lines, the same experimental procedures, and the same mass spectrometers for multiple experiments; thus, registration of the experimental procedures as presets could save time when inputting meta-data information.

‘Sample’ preset

The ‘Sample’ preset includes biological sample information, such as species, tissue, cell type, and disease. For the input of such data, the ‘choosing from options’ style-user interface is more preferable than the ‘inputting manually’ style-interface; users can save their inputs and the use of unified terms improves data retrieval. For example, the options of ‘species’ are derived from the NCBI organismal classification (NCBITaxon) (12) ontology. Similarly, ontology/controlled vocabulary-based options are available for almost all items in all four presets, e.g. for the experiment and analysis pipelines of MS, Proteomics Standard Initiative (PSI)-Mass Spectrometry Controlled Vocabulary (CV) (13) is available, and for post-translational modifications (PTMs), the UNIMOD (14) CV is available. When no terms of interest are available in the default term sets, users can search the term from the EMBL-EBI Ontology Lookup Service (15), which is embedded in the input form.

‘Fractionation’ preset

A general shotgun proteome analysis based on liquid chromatography tandem MS (LC-MS/MS) often requires prefractionation of samples at the protein and/or peptide levels. The ‘Fractionation’ preset is a generalized and simplified representation of various methodologies for the prefractionation process, such as the removal of highly expressed proteins by affinity chromatography for isolating low-abundance proteins and the enrichment of specific types of digested peptides, e.g. phosphorylated peptides using chemo-affinity chromatography with metal oxides or metal ions. The contents of this preset are classified into three separation levels: subcellular, protein and peptide. The subcellular level describes mainly subcellular locations of proteins, including whole cells. The protein level describes the prefractionation methods before protease digestion, such as sodium dodecyl sulfate polyacrylamide gel electrophoresis (SDS-PAGE) and immunoprecipitation. The peptide level describes the peptide prefractionation method, e.g. reversed phase LC at basic pH. The numbers of fractions and replicates are also described.

‘Enzyme/Modifications’ preset

The ‘Enzyme/Modifications’ preset accepts information of parameters specified in the database search. A generic shotgun proteomics workflow contains a digestion step by proteases (or chemicals) to produce peptides. The Enzyme/Modification preset describes the name of the employed enzymes or chemicals. For the enzyme name, combinations of more than one enzyme are also allowed as input. In addition to the enzyme information, this preset addresses supposed PTMs; fixed and variable modifications specified in the database search are required as input. The species name can be described; similarly, the species name specified in database search should be input independently of the description in the ‘Sample’ preset.

‘MS mode’ preset

The ‘MS mode’ preset accepts information regarding the types of MS instruments (mass spectrometers) and the configurations used in the project. For isobaric labelling experiments, such as iTRAQ (16) and Tandem Mass Tag (17), multiple samples labeled by isobaric tags are measured at the same time. Information of such multiplex experiments is also described in this MS mode preset.

Profile consisting of four types of presets

Raw proteomics data files (MS measurements) are linked to a ‘profile,’ i.e. the combination of the four presets. Our repository is designed to reduce user time and load for submission; thus, users can apply a profile to multiple files at one time using a simple drag-and-drop approach on the web browser. The repository accepts both raw mass spectrum files and peak list files generated from spectra, i.e. mzML (18) and Mascot Generic Format (MGF; http://www.matrixscience.com/help/data_file_help.html); database-search result files, i.e. mzIdentML (19) and mzTab (20); commonly used software, such as Mascot (21), X!Tandem (22) and MaxQuant (23) and the settings used in database searches.

File upload

After linkage of the meta-data profile (presets) and deposited data files, users can actually upload the files to the repository. During this upload process, the files are split into fragments of smaller size, called ‘chunks,’ which are subsequently uploaded to the repository in parallel. The latency, i.e. the delay before the actual data transfer, is known to increase during data transfer via the Internet as the length of the data communication path increases. Thus, the data transfer speed between physically distant locations is often extremely slow. The parallel upload of small-sized chunks can improve this latency problem. As shown in Figure 2A, when datasets are uploaded to our repository, a positive correlation is observed between the file size and the transfer duration time; the average transfer speed is quite fast, ∼9MB/s. In addition, the file transfer speed is, in most cases, independent of the distance from where the user deposits the data (Figure 2B). The most distantly located site was more than 5000 km far from the server location, and the file transfer speed was sufficient (∼5MB/s). Thus, the effect of latency would be limited during file upload to the repository, even from more distant locations.

Figure 2.

File transfer speed to jPOSTrepo. (A) Linear relationships between the file size and duration for uploading to the repository. (B) File transfer speed to the repository is independent of the distance between users and the repository server. Even at the most distant location (>5000 km), outside of Japan, where the repository server is established, a high transfer speed of ∼5MB/s was achieved. The inner panel shows magnification of the same figure excluding data for the most distant location. Bars represent standard deviations at each distance.

Partial and complete submission

jPOSTrepo is currently a member of the PX consortium. The PX consortium provides standard criteria of quality for datasets deposited from users. The complete submission recommended by the PX consortium requires meta-data for biological samples, experimental procedures used in the study, and the corresponding relationship between each peptide in the search result files and each peak in the peak list files. The repository performs the validation process for the complete submission. For the complete submission of datasets, a table view of search results is provided, and the corresponding relationships of the mass peaks and the peptides can be visualized. As with the other repositories, our repository functions to make datasets public as a partial submission. The partial submission requires raw data files, search result files, and valid metadata under the standard criteria of the PX consortium for complete submission.

Data publication and download

When the validation process is finished and the deposited datasets are determined to be valid in the context of the PX consortium criteria, users can finish and lock the submission process, yielding both jPOST and PX identifiers. At this stage of the submission of a ‘project’, the dataset deposited to the repository is under an ‘embargo’ as private, and is automatically published at the ‘announcement date’ as public, set by users themselves. During this embargo period, users can issue a specialized URL and password to access the project by anyone who knows them, such as journal editors and reviewers. In addition, users can revise the temporarily locked project according to the comments by reviewers. Revision numbers are provided for the revised data. The datasets of publicly available projects can be downloaded without any limitations. In addition, public projects can be searched by registered keywords, i.e. ontology terms, CVs and principal investigator names registered as presets and projects.

System implementation

The file upload system in our repository employs an open-source JavaScript library, flow.js (https://github.com/flowjs/). The visualization system of the corresponding relationships of mass peaks and peptides for the complete submission employs an open-source library, Lorikeet (http://uwpr.github.io/Lorikeet/), which is a plugin of the jQuery library (https://jquery.com/) to view MS/MS spectra annotated with fragment ions.

DISCUSSION

We have developed and launched jPOSTrepo to share and store a variety of proteome datasets for researchers worldwide. This repository is now operating in the Asia/Oceania area where no international proteome repository had been established, and accepts MS raw and processed data for proteomics from all over the world as a member of the PX consortium. Users can obtain a global common accession number for their deposited data by both the ‘complete’ submission or the ‘partial’ submission. The repository also stores detailed metadata, such as samples, instruments, analysis software, and settings, as four types of presets with some ontologies and CVs. It also provides an ultra-fast file uploader and user-friendly web interface. All metadata submitted in our repository, especially all experimental procedures described in the current four types of presets, are expressed with ontologies/vocabularies and are therefore ‘computer-readable.’ When this information is expressed further by some unified framework, such as the Resource Description Framework (RDF) data model, the proteome data could also be linked to a wide variety of other data, such as genomes and transcriptomes, under the ‘linked-open data’ concept. Therefore, our repository could be used to develop a system to enrich RDF expression for proteomic datasets and joint/integrate them with other data resources.

23 in total

1. A method for reducing the time required to match protein sequences with tandem mass spectra.

Authors: Robertson Craig; Ronald C Beavis
Journal: Rapid Commun Mass Spectrom Date: 2003 Impact factor: 2.419

2. Unimod: Protein modifications for mass spectrometry.

Authors: David M Creasy; John S Cottrell
Journal: Proteomics Date: 2004-06 Impact factor: 3.984

3. The need for a public proteomics repository.

Authors: John T Prince; Mark W Carlson; Rong Wang; Peng Lu; Edward M Marcotte
Journal: Nat Biotechnol Date: 2004-04 Impact factor: 54.908

4. The Ontology Lookup Service: bigger and better.

Authors: Richard Côté; Florian Reisinger; Lennart Martens; Harald Barsnes; Juan Antonio Vizcaino; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2010-05-11 Impact factor: 16.971

5. The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary.

Authors: Gerhard Mayer; Luisa Montecchi-Palazzi; David Ovelleiro; Andrew R Jones; Pierre-Alain Binz; Eric W Deutsch; Matthew Chambers; Marius Kallhardt; Fredrik Levander; James Shofstahl; Sandra Orchard; Juan Antonio Vizcaíno; Henning Hermjakob; Christian Stephan; Helmut E Meyer; Martin Eisenacher
Journal: Database (Oxford) Date: 2013-03-12 Impact factor: 3.451

6. mzML--a community standard for mass spectrometry data.

Authors: Lennart Martens; Matthew Chambers; Marc Sturm; Darren Kessner; Fredrik Levander; Jim Shofstahl; Wilfred H Tang; Andreas Römpp; Steffen Neumann; Angel D Pizarro; Luisa Montecchi-Palazzi; Natalie Tasman; Mike Coleman; Florian Reisinger; Puneet Souda; Henning Hermjakob; Pierre-Alain Binz; Eric W Deutsch
Journal: Mol Cell Proteomics Date: 2010-08-17 Impact factor: 5.911

7. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

8. ArrayExpress update--simplifying data submissions.

Authors: Nikolay Kolesnikov; Emma Hastings; Maria Keays; Olga Melnichuk; Y Amy Tang; Eleanor Williams; Miroslaw Dylag; Natalja Kurbatova; Marco Brandizi; Tony Burdett; Karyn Megy; Ekaterina Pilicheva; Gabriella Rustici; Andrew Tikhonov; Helen Parkinson; Robert Petryszak; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

9. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

10. GenBank.

Authors: Karen Clark; Ilene Karsch-Mizrachi; David J Lipman; James Ostell; Eric W Sayers
Journal: Nucleic Acids Res Date: 2015-11-20 Impact factor: 16.971

133 in total

1. BioID screening of biotinylation sites using the avidin-like protein Tamavidin 2-REV identifies global interactors of stimulator of interferon genes (STING).

Authors: Kou Motani; Hidetaka Kosako
Journal: J Biol Chem Date: 2020-06-17 Impact factor: 5.157

2. Time-resolved transcriptome and proteome landscape of human regulatory T cell (Treg) differentiation reveals novel regulators of FOXP3.

Authors: Angelika Schmidt; Francesco Marabita; Narsis A Kiani; Catharina C Gross; Henrik J Johansson; Szabolcs Éliás; Sini Rautio; Matilda Eriksson; Sunjay Jude Fernandes; Gilad Silberberg; Ubaid Ullah; Urvashi Bhatia; Harri Lähdesmäki; Janne Lehtiö; David Gomez-Cabrero; Heinz Wiendl; Riitta Lahesmaa; Jesper Tegnér
Journal: BMC Biol Date: 2018-05-07 Impact factor: 7.431

3. Quantitative Proteomics of the Mitotic Chromosome Scaffold Reveals the Association of BAZ1B with Chromosomal Axes.

Authors: Shinya Ohta; Takako Taniguchi; Nobuko Sato; Mayako Hamada; Hisaaki Taniguchi; Juri Rappsilber
Journal: Mol Cell Proteomics Date: 2018-09-28 Impact factor: 5.911

4. Boosting to Amplify Signal with Isobaric Labeling (BASIL) Strategy for Comprehensive Quantitative Phosphoproteomic Characterization of Small Populations of Cells.

Authors: Lian Yi; Chia-Feng Tsai; Ercument Dirice; Adam C Swensen; Jing Chen; Tujin Shi; Marina A Gritsenko; Rosalie K Chu; Paul D Piehowski; Richard D Smith; Karin D Rodland; Mark A Atkinson; Clayton E Mathews; Rohit N Kulkarni; Tao Liu; Wei-Jun Qian
Journal: Anal Chem Date: 2019-03-15 Impact factor: 6.986

5. Estimating the Efficiency of Phosphopeptide Identification by Tandem Mass Spectrometry.

Authors: Chuan-Chih Hsu; Liang Xue; Justine V Arrington; Pengcheng Wang; Juan Sebastian Paez Paez; Yuan Zhou; Jian-Kang Zhu; W Andy Tao
Journal: J Am Soc Mass Spectrom Date: 2017-03-10 Impact factor: 3.109

6. Disruption of Acetyl-Lysine Turnover in Muscle Mitochondria Promotes Insulin Resistance and Redox Stress without Overt Respiratory Dysfunction.

Authors: Ashley S Williams; Timothy R Koves; Michael T Davidson; Scott B Crown; Kelsey H Fisher-Wellman; Maria J Torres; James A Draper; Tara M Narowski; Dorothy H Slentz; Louise Lantier; David H Wasserman; Paul A Grimsrud; Deborah M Muoio
Journal: Cell Metab Date: 2019-12-05 Impact factor: 27.287

7. PITHD1 is a proteasome-interacting protein essential for male fertilization.

Authors: Hiroyuki Kondo; Takafumi Matsumura; Mari Kaneko; Kenichi Inoue; Hidetaka Kosako; Masahito Ikawa; Yousuke Takahama; Izumi Ohigashi
Journal: J Biol Chem Date: 2020-01-08 Impact factor: 5.157

8. Universal Plant Phosphoproteomics Workflow and Its Application to Tomato Signaling in Response to Cold Stress.

Authors: Chuan-Chih Hsu; Yingfang Zhu; Justine V Arrington; Juan Sebastian Paez; Pengcheng Wang; Peipei Zhu; I-Hsuan Chen; Jian-Kang Zhu; W Andy Tao
Journal: Mol Cell Proteomics Date: 2018-07-13 Impact factor: 5.911

9. Extracellular vesicles secreted by HBV-infected cells modulate HBV persistence in hydrodynamic HBV transfection mouse model.

Authors: Masatoshi Kakizaki; Yuichiro Yamamoto; Motoyuki Otsuka; Kouichi Kitamura; Masatoshi Ito; Hideki Derek Kawai; Masamichi Muramatsu; Tatehiro Kagawa; Ai Kotani
Journal: J Biol Chem Date: 2020-07-10 Impact factor: 5.157

10. Identification of a Specific Translational Machinery via TCTP-EF1A2 Interaction Regulating NF1-associated Tumor Growth by Affinity Purification and Data-independent Mass Spectrometry Acquisition (AP-DIA).

Authors: Daiki Kobayashi; Takaho Tokuda; Kyosuke Sato; Hiroki Okanishi; Megumi Nagayama; Mio Hirayama-Kurogi; Sumio Ohtsuki; Norie Araki
Journal: Mol Cell Proteomics Date: 2018-10-31 Impact factor: 5.911