Literature DB >> 34871441

iProX in 2021: connecting proteomics data sharing with big data.

Tao Chen¹, Jie Ma¹, Yi Liu¹, Zhiguang Chen², Nong Xiao², Yutong Lu², Yinjin Fu², Chunyuan Yang¹, Mansheng Li¹, Songfeng Wu¹, Xue Wang¹, Dongsheng Li¹, Fuchu He¹, Henning Hermjakob^1,3, Yunping Zhu^1,4.

Abstract

The rapid development of proteomics studies has resulted in large volumes of experimental data. The emergence of big data platform provides the opportunity to handle these large amounts of data. The integrated proteome resource, iProX (https://www.iprox.cn), which was initiated in 2017, has been greatly improved with an up-to-date big data platform implemented in 2021. Here, we describe the main iProX developments since its first publication in Nucleic Acids Research in 2019. First, a hyper-converged architecture with high scalability supports the submission process. A hadoop cluster can store large amounts of proteomics datasets, and a distributed, RESTful-styled Elastic Search engine can query millions of records within one second. Also, several new features, including the Universal Spectrum Identifier (USI) mechanism proposed by ProteomeXchange, RESTful Web Service API, and a high-efficiency reanalysis pipeline, have been added to iProX for better open data sharing. By the end of August 2021, 1526 datasets had been submitted to iProX, reaching a total data volume of 92.42TB. With the implementation of the big data platform, iProX can support PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities that meet the requirements of the fast growing field of proteomics.

Entities: Chemical

Mesh：

Substances：
Proteome

Year: 2022 PMID： 34871441 PMCID： PMC8728291 DOI： 10.1093/nar/gkab1081

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

With the advance of high-throughput technologies, large-scale biological data has accumulated at an unprecedented rate. First, methods were established for the genome and transcriptome, followed by the proteome and metabolome, which may be collectively designated ‘multi-omics’ (1,2). Today, nearly all fields in life sciences can be connected by big data. As in other areas of big data science, the major challenges in the proteomics field include data storage, management, analyses, and open sharing. The ProteomeXchange (PX, http://www.proteomexchange.org/) consortium (3,4) coordinates a stable, distributed infrastructure for effective proteomics data sharing through interaction with PX members, including PRIDE (http://www.ebi.ac.uk/pride/archive/, EMBL-EBI, Cambridge, UK) (5), PeptideAtlas with the PASSEL resource (http://www.peptideatlas.org/passel/, ISB, Seattle, WA, USA) (6), MassIVE (https://massive.ucsd.edu/, UCSD, San Diego, CA, USA), jPOST (https://jpostdb.org/, various institutions, Japan) (7), iProX (National Center for Protein Sciences, Beijing, China) (8), and Panorama Public (https://panoramaweb.org/, University of Washington, Seattle, WA, USA) (9). Driven by improvements in speed and resolution of mass spectrometry (MS), the scale and complexity of proteomics datasets are expanding, which has resulted in a rapid accumulation of data in almost all PX repositories. To move forward with sharing large proteomics datasets, both hardware and software must continue to improve; therefore, incorporating big data technology, framework and scalable cloud-based solutions will help to manage huge proteomics datasets. iProX was launched in April 2017 and joined the PX consortium in November 2017. As a full member of the consortium, iProX (https://www.iprox.cn/) has been updated significantly by implementing an up-to-date big data platform in 2021. It can support storage and rapid access to large amounts of proteomic data. Here, we describe the main developments in iProX since its first publication in Nucleic Acids Research in 2019. First, we summarize the overall data submission and data statistics to demonstrate the wide adoption of iProX. Then, we highlight the big data architecture and infrastructure of iProX, which can support PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities to meet the requirements of rapidly accumulating proteomics data. Finally, we introduce the implementation of the Universal Spectrum Identifier (USI), RESTful Web Service API, and the reanalysis and visualization pipeline of public data in iProX.

CURRENT STATUS AND UPDATES OF REPOSITORY

By the end of August 2021, 1526 datasets had been submitted to iProX, consisting of 984 public released datasets (64%) and 542 to-be-released datasets (36%), for a total of 92.42TB of accumulated data. As shown in Figure 1A-C, there has been a rapid growth in the number and size of proteomics datasets generated from 2017 to 2021. This has been facilitated by the continuous improvements of mass spectrometry instruments and sample handling workflows (10). Significantly large (>500GB) and super large (>1TB) proteome studies have been completed in recent years (Figure 1B and Supplementary Figure S1). There are 19 datasets with more than 1 TB in data size submitted to iProX (Supplementary Figure S1) and 18 (94.7%) of them have been submitted within past 3 years (2019–2021). As shown in Figure 1D, the most frequent species in iProX include Homo sapiens, Mus musculus, Rattus norvegicus, Escherichia coli, Saccharomyces cerevisiae, Triticum aestivum, Oryza sativa and Zea mays. Moreover, the datasets generated from some species with few proteome studies have also been collected in iProX (Supplementary Figure S2). For example, datasets generated from studies on Haemaphysalis longicornis (10 datasets), Anabaena sp. PCC 7120 (Nostoc sp. PCC 7120, 3 datasets), Bambusa pervariabilis (2 datasets), and Yersinia pestis CO92 (7 datasets) in iProX, represent almost all the proteome datasets from these species in ProteomeXchange.

Figure 1.

Summary of the datasets publicly released in iProX (as of the end of August 2021). (A) Cumulative data size and number of submitted datasets per month to (ranging from November 2017 to August 2021). (B) Top 10 released datasets with the largest size. (C) Cumulative numbers of submitted datasets per year. Some datasets in iProX were generated by the samples from multiple species, thus, the sum of the numbers of different species is a little higher than the number of all public datasets. (D) Distribution of the species of datasets publicly available in iProX.

BIG DATA ARCHITECTURE AND INFRASTRUCTURE OF IPROX

First, a hyper-converged architecture with high scalability has been constructed to support the submission process. Next, a hadoop cluster (11) is used to store the large amounts of proteomics data. The storage capacity of iProX has expanded from 360TB to 1PB in 2021. Also, a distributed, RESTful-styled Elastic Search engine (12) was employed to retrieve millions of records within one second. As shown in Figure 2, for the submission process, the hyper-converged architecture integrates the server, storage, and network resources into a virtual pool, so that it can achieve high scalability using a distributed scale-out expansion for storage and computing resources. Reanalyzed datasets, such as that for proteins, peptides and spectra, are stored in a distributed column-oriented storage database known as HBase (13) in the Hadoop cluster. The indexes of the identifications are stored in an Elastic Search cluster for a distributed, high-scalable, real-time search and data analysis. A universal search interface for both web-based and APIs are provided for obtaining the metadata and reanalysis data of the original submitted datasets.

Figure 2.

Hadoop-based big data architecture and infrastructure of iProX.

Hadoop-based big data architecture and infrastructure of iProX. In order to achieve independent, high-speed data file transfer, both web-based and fast Asepra (https://asperasoft.com/) based upload and download steps were reconstructed into independent transferring sub-services through the RESTful API interface, which refers to the service-oriented architecture. To provide continuous and reliable high speed transfer services, we also upgraded Aspera into the latest release and integrated it into the iProX system. Searching metadata as well as identified proteins, peptides and spectra are also contained within sub-services to achieve a second level response without interrupting the regular submissions from individual submitters and large-scale projects. Thus, we have improved the performance of iProX in terms of high availability, high reliability and real-time response without changing the original submission process. Notably, a disaster recovery system and full real-time backup site of iProX was designed and deployed to the National Supercomputing Center in Guangzhou (Guangdong, southern China), which can take over the service within several minutes when the main site in Beijing is unavailable. Thus, iProX can provide continuous uninterrupted service.

NEW FEATURES IN IPROX 2021

Based on the implementation of the hadoop-based big data platform for iProX, we developed several new features, including the implementation of Universal Spectrum Identifier (USI), the reanalysis and visualization pipeline for public data in iProX, and RESTful Web Service APIs, as shown in Figure 3.

Figure 3.

New features implemented into iProX 2021.

UNIVERSAL SPECTRUM IDENTIFIER (USI)

The ability to refer to specific spectra of high importance and cite data in published manuscripts was done by implementing a new standardized Universal Spectrum Identifier (USI) proposed by PX (14,15). The USI is a multi-part key separated by colon characters; for example, mzspec:PXD006512:CNHPP_HCC_LC_profiling_L006_P_F1:scan:64442:VADALTNAVAHVDDMPNALSALSDLHAHK/3. This USI consists of five basic components, which indicates that the peptide VADALTNAVAHVDDMPNALSALSDLHAHK with a charge of 3 is identified in the raw file (CNHPP_HCC_LC_profiling_L006_P_F1.raw of PXD006512 dataset), with the scan number, 64 442. By this interpretation, it can be encoded to the specific spectrum contained in the dataset deposited to iProX. In iProX, USI locates the spectrum in the HBase cluster by using the Elastic Search engine. iProX enables USI lookup and display at http://www.iprox.cn/page/spectrum.html for 20 million spectra in the HBase. Users can also access this page by clicking the ‘Resources’ menu on the iProX main page and selecting the sub-menu ‘Universal Spectrum Identifier (USI) Search’. Users paste a USI into a text box on the page and press the ‘view’ button to lookup the spectrum. The lookup result returned from the Elastic Search engine is visualized with an embedded Lorikeet spectrum viewer in the page. Instead, USI lookup service may also be triggered automatically by using the URL https://www.iprox.cn/page/spectrum.html?USI=. For example, the URL, https://www.iprox.cn/page/spectrum.html?USI = mzspec:PXD006512:CNHPP_HCC_LC_profiling_L006_P_F1:scan:64442:VADALTNAVAHVDDMPNALSALSDLHAHK/3, can be used to lookup the spectrum identified by the USI in the URL and visualize the spectrum, if available. Along with the spectra identified by the USIs, the identified proteins and peptides are obtained by our designed reanalysis pipeline and stored in the HBase cluster. The results files of the ‘complete submission’ project in iProX are also parsed into spectra records and uniquely marked by USIs.

REANALYSIS AND VISUALIZATION PIPELINE OF PUBLIC DATA

Due to the unprecedented availability of proteomics data in the public domain, the need for data reuse is increasing (16,17). A high-efficiency reanalysis pipeline was built and applied to reuse the released data in iProX and identify evidence to analyze the publicly released datasets. This process generated millions of high-quality spectrum and identifications. The identified proteins are provided with UniProt accession numbers and associated URLs. At present, this reanalysis pipeline can handle the datasets generated from DDA (Data Dependent Acquisition) workflow. We applied the above reanalysis pipeline to the public dataset, IPX0000937000 (18), and derived 20 million new identifications at controlled false discovery rates. All of these identifications were parsed and stored into an HBase cluster. These reanalysis data are readily accessible at a new search interface based on the Elastic Search engine and can be traced back to the original datasets by IPX accession numbers. We will reanalyze all the public data, build a large-scale spectral library, and cross-reference MS identifications with other external datasets, such as UniProt. Clicking the ‘Resources’ menu on the iProX main page and selecting the sub-menu ‘Reanalyzed Datasets’ reveals the list of the reanalyzed datasets with the number of identifications. By clicking the ‘view’ button along with the datasets, one can see the identification's visualization pipeline. The menu tabs ‘Protein’, ‘Peptide,’ and ‘Spectrum’ are linked to the details of the identifications for the chosen dataset. IPX accession numbers on the page can trace back to the metadata of the chosen dataset. Specifically, the link on the protein ID can be directed to all the identified peptides corresponding to the protein. Clicking ‘Peptide’ then links to the spectra related to the peptide. The link on the USI is directed to the visualization page for the spectrum. The ‘Summary charts’ menu links to the page with the statistics of these identifications.

IPROX RESTFUL WEB SERVICE API

Besides supporting human interactions to access the data, iProX provides a RESTful Web Service Application Programming Interface (API) presented by PX for automatically accessing proteomics results. It reports the metadata of datasets, or peptide, protein, and spectra data for reanalysis, including getting the metadata of a specific dataset or lists of datasets and collecting peptidoforms, proteins, and peptide spectrum matches (PSMs), or a list of spectra referred by USIs. These APIs are provided at https://www.iprox.cn/proxi/swagger-ui.html. Users can also access this page by clicking the ‘Resources’ menu on the main page and selecting the sub-menu ‘Web Service APIs’. For example, ‘https://www.iprox.cn/proxi/datasets/PXD008840’ returns the json format of the details of the dataset, PXD008840, which can be used for reprogramming.

DISCUSSION AND FUTURE PLANS

The generation of tons of proteome data leads us towards the ‘big data’ era in proteomics. It is now easier than ever to produce large amounts of proteomics data. In a few days, a protein scientist can create a terabyte of data. In recent years, many datasets with continuously increasing data sizes have been submitted and deposited to iProX. There is an urgent need to bridge the connection between proteomics big data with its simple transfering, open sharing and efficient reuse in the scientific field. iProX can manage large and complex proteomics data by updating the software and hardware architecture using a big data platform, supporting PB-level data storage, hundreds of billions of spectra records, and second-level latency service capabilities. Also, a high-efficiency reanalysis pipeline was built in iProX and used to analyze publicly available datasets and generate millions of high-quality spectra and identifications. Currently, we have primarily analyzed several large-scaled datasets [e.g. human proteome datasets generated from the Chinese Human Proteome Project (CNHPP)]. We will then transition to non-human organisms, particularly focusing on species with limited proteome information, in which MS-based proteomics data is mostly collected in iProX. These smaller datasets will provide useful resources for their specific fields of study. At present, all PX resources are committed to completely open data, which means there are currently no limitations for data reuse by the community (19). One increasingly relevant topic is whether proteomics data from human samples can be used to identify individuals and whether proteomics data may be considered personally identifiable information (19,20). The privacy risks of proteomics data may emerge, and this issue will need to be properly assessed and managed. Currently, all PX resources have decided to phase the license for dataset sharing and reuse from a Creative Commons CC0 license as a minimum level initially and will likely move to CC-KY in the future (3). iProX has established its data license terms since it began in 2017, and we are moving towards the CC-BY license proposed by PX. Also, the management rule for human genetic resources in China was implemented in July 2019 and should be integrated into the improved data license for iProX. In this study, with the implementation of a big data platform, iProX can support PB-level data storage, hundreds of billions of spectral records, and second-level latency service capabilities to meet the needs of the rapid growth proteomics field. iProX has an important role in facilitating the analysis and sharing of proteomics data worldwide. Click here for additional data file.

19 in total

1. PASSEL: the PeptideAtlas SRMexperiment library.

Authors: Terry Farrah; Eric W Deutsch; Richard Kreisberg; Zhi Sun; David S Campbell; Luis Mendoza; Ulrike Kusebauch; Mi-Youn Brusniak; Ruth Hüttenhain; Ralph Schiess; Nathalie Selevsek; Ruedi Aebersold; Robert L Moritz
Journal: Proteomics Date: 2012-04 Impact factor: 3.984

2. Biology: The big challenges of big data.

Authors: Vivien Marx
Journal: Nature Date: 2013-06-13 Impact factor: 49.962

3. Universal Spectrum Explorer: A Standalone (Web-)Application for Cross-Resource Spectrum Comparison.

Authors: Tobias Schmidt; Patroklos Samaras; Viktoria Dorfer; Christian Panse; Tobias Kockmann; Leon Bichmann; Bart van Puyvelde; Yasset Perez-Riverol; Eric W Deutsch; Bernhard Kuster; Mathias Wilhelm
Journal: J Proteome Res Date: 2021-05-10 Impact factor: 4.466

4. The application of Hadoop in structural bioinformatics.

Authors: Jamie J Alnasir; Hugh P Shanahan
Journal: Brief Bioinform Date: 2018-11-20 Impact factor: 11.622

5. Panorama Public: A Public Repository for Quantitative Data Sets Processed in Skyline.

Authors: Vagisha Sharma; Josh Eckels; Birgit Schilling; Christina Ludwig; Jacob D Jaffe; Michael J MacCoss; Brendan MacLean
Journal: Mol Cell Proteomics Date: 2018-02-27 Impact factor: 5.911

6. Enabling Massive XML-Based Biological Data Management in HBase.

Authors: Jian Liu; Qiuru Liu; Lei Zhang; Shuhui Su; Yongzhuang Liu
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2020-12-08 Impact factor: 3.710

7. Ethical Principles, Constraints and Opportunities in Clinical Proteomics.

Authors: Sebastian Porsdam Mann; Peter V Treit; Philipp E Geyer; Gilbert S Omenn; Matthias Mann
Journal: Mol Cell Proteomics Date: 2021-01-14 Impact factor: 5.911

8. ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Authors: Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob
Journal: Nat Biotechnol Date: 2014-03 Impact factor: 54.908