Literature DB >> 34723319

The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.

Yasset Perez-Riverol1, Jingwen Bai1, Chakradhar Bandla1, David García-Seisdedos1, Suresh Hewapathirana1, Selvakumar Kamatchinathan1, Deepti J Kundu1, Ananth Prakash1, Anika Frericks-Zipper2,3, Martin Eisenacher2,3, Mathias Walzer1, Shengbo Wang1, Alvis Brazma1, Juan Antonio Vizcaíno1.   

Abstract

The PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) is the world's largest data repository of mass spectrometry-based proteomics data. PRIDE is one of the founding members of the global ProteomeXchange (PX) consortium and an ELIXIR core data resource. In this manuscript, we summarize the developments in PRIDE resources and related tools since the previous update manuscript was published in Nucleic Acids Research in 2019. The number of submitted datasets to PRIDE Archive (the archival component of PRIDE) has reached on average around 500 datasets per month during 2021. In addition to continuous improvements in PRIDE Archive data pipelines and infrastructure, the PRIDE Spectra Archive has been developed to provide direct access to the submitted mass spectra using Universal Spectrum Identifiers. As a key point, the file format MAGE-TAB for proteomics has been developed to enable the improvement of sample metadata annotation. Additionally, the resource PRIDE Peptidome provides access to aggregated peptide/protein evidences across PRIDE Archive. Furthermore, we will describe how PRIDE has increased its efforts to reuse and disseminate high-quality proteomics data into other added-value resources such as UniProt, Ensembl and Expression Atlas.
© The Author(s) 2021. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Substances:

Year:  2022        PMID: 34723319      PMCID: PMC8728295          DOI: 10.1093/nar/gkab1038

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Data sharing in the public domain has become the standard for proteomics researchers. The growth in recent years has been very remarkable and as a result, the number of proteomics datasets deposited every year in open public repositories is now comparable to transcriptomics (1). Since 2004, the PRoteomics IDEntifications (PRIDE) database (https://www.ebi.ac.uk/pride/) at the European Bioinformatics Institute (EMBL-EBI, Hinxton, Cambridge, UK) has enabled public data deposition of mass spectrometry (MS)-based proteomics data, providing access to the experimental data described in scientific publications (2). Since then, and especially in recent years, PRIDE Archive (the archival component of PRIDE) has become the largest repository for proteomics data sharing worldwide (2,3). PRIDE stores datasets coming from all proteomics experimental approaches, with a focus on discovery-driven techniques such data dependent acquisition (DDA) and data independent acquisition (DIA) bottom-up proteomics, but also top-down proteomics and MS imaging, among others. For each dataset submitted to PRIDE Archive (the archival component of PRIDE), the MS raw files (output files from the mass spectrometers) and the processed results (at least peptide/protein identification results, quantification information is optional) must be provided. In addition, each dataset in PRIDE Archive can contain peptide/protein quantitation result files, the mass spectra as peak list files, the searched protein sequence databases or spectral libraries, programming scripts, and any other technical and/or biological metadata provided by the data submitters (4). The PRIDE team has led within the Proteomics Standards Initiative (PSI) organization, the creation and implementation of multiple standard open file formats such as mzTab (5), mzIdentML (6) and mzML (7) to store, process and visualize the proteomics data deposited. The stand-alone ProteomeXchange (PX) Submission tool (8) allows the researchers to perform the data submissions to PRIDE Archive, while PRIDE Inspector (9) enables users to review the dataset before, during, and after has been deposited in the resource. After the submission is completed, different pipelines perform the validation and quality assessment of the reported results and store the data into multiple databases for enabling data access and visualization in the PRIDE Archive web interface (https://www.ebi.ac.uk/pride/archive) and also programmatically via the PRIDE Application Programming Interface (API, https://www.ebi.ac.uk/pride/ws/archive/v2/). In recent years, PRIDE Archive has been moving its visualization components from desktop-based applications (e.g., PRIDE Inspector) to Restful APIs and web-based interfaces. All submitted files are available to download via FTP or the Aspera file transfer protocol. PRIDE resources have two main missions for the proteomics community: (i) support data deposition and quality assessment of submitted proteomics experiments, to help reproducible research; and (ii) promote and facilitate the reuse of public proteomics data, and disseminate high-quality proteomics evidences into added-value resources, including Ensembl (10), UniProt (11) and Expression Atlas (12). The PRIDE database was one of the founders of the PX consortium in 2011 (3,8). PX defines the guidelines for data submission and dissemination of public proteomics data worldwide. As of 2021, the resources PeptideAtlas (13), including its related resource PASSEL (PeptideAtlas SRM Experiment Library) (14), MassIVE (15), jPOST (16), iProX (17) and Panorama Public (18) are the active members of the consortium. PX coordinates the release of accession numbers for every submitted dataset and a set of services for providing unified access to publicly available datasets (http://proteomecentral.proteomexchange.org/cgi/GetDataset), including specific data types such as mass spectra, using Universal Spectrum Identifiers (19) (http://proteomecentral.proteomexchange.org/usi/). Additionally, in 2017, PRIDE became an ELIXIR (http://www.elixir-europe.org) core data resource (20) and ELIXIR deposition database, recognizing its key role in the life sciences. In this manuscript, we will summarize the main PRIDE-related developments in the last three years, since the previous Nucleic Acids Research (NAR) database update manuscript was published (2). We will discuss PRIDE Archive first but will also provide updated information about the PRIDE-related tools and other ongoing activities including the updates in the PRIDE Spectra Archive and PRIDE Peptidome. Additionally, we will also report about the work performed to disseminate and integrate proteomics data in other EMBL-EBI resources.

CURRENT STATUS OF THE PRIDE ECOSYSTEM: RESOURCES AND TOOLS

The PRIDE database ecosystem (https://www.ebi.ac.uk/pride/) is composed of a comprehensive set of libraries, desktop tools, databases, large-scale pipelines, Restful APIs and web applications (Figure 1). A set of open-source Java libraries including jmzTab (21), jmzIdentML (22), ms-data-core-api (23) and the protein inference algorithms toolkit (PIA) (24,25) supported and maintained by the PRIDE team, allows to read, validate, process, and store proteomics data encoded in PSI open file formats. PRIDE Archive pipelines (2) perform a set of validation and quality checks to make sure the deposited files are semantically valid, and that the metadata provided during the submission is correct, in addition to moving the submitted datasets into the EMBL-EBI production filesystem.
Figure 1.

Schema of the PRIDE resources ecosystem. PRIDE Archive users must provide the raw files, the processed results files, and metadata about every given dataset. Standard file formats (for processed result files) can be provided for 'Complete' submissions. A group of open-source libraries is used by the PX Submission tool, and the PRIDE pipelines to validate, assess the quality of the reported peptides and proteins, and store the information (metadata, peptides/proteins and spectra) into multiple databases. The PRIDE Peptidome resource selects high-quality peptides across all the datasets in PRIDE Archive. All the data from PRIDE Archive and PRIDE Peptidome is served to external users such as Ensembl and UniProt through the PRIDE API and PRIDE web interface. Additionally, proteomics quantitative datasets are reanalyzed and integrated into Expression Atlas.

Schema of the PRIDE resources ecosystem. PRIDE Archive users must provide the raw files, the processed results files, and metadata about every given dataset. Standard file formats (for processed result files) can be provided for 'Complete' submissions. A group of open-source libraries is used by the PX Submission tool, and the PRIDE pipelines to validate, assess the quality of the reported peptides and proteins, and store the information (metadata, peptides/proteins and spectra) into multiple databases. The PRIDE Peptidome resource selects high-quality peptides across all the datasets in PRIDE Archive. All the data from PRIDE Archive and PRIDE Peptidome is served to external users such as Ensembl and UniProt through the PRIDE API and PRIDE web interface. Additionally, proteomics quantitative datasets are reanalyzed and integrated into Expression Atlas. When a given dataset is made public, a group of post-submission pipelines parses the peptides and proteins identified in the dataset—if the dataset is a ‘complete’ submission (4)—and index them into Apache Solr and MongoDB-based infrastructure enabling to search datasets by the identified peptides and proteins. The PRIDE Spectra Archive and PRIDE Peptidome provide access to the mass spectra identified in the PRIDE Archive and to a condensed view of high-quality identified peptides across PRIDE Archive datasets, respectively. All data from PRIDE Archive and related resources are served through the PRIDE Restful API and the web application.

Data submission

The PRIDE Archive guidelines for data submission including the required data files and metadata have not changed substantially in recent years, in parallel to PX requirements. Previous publications (2,4) explain in detail the main formats supported, the type of submissions (‘complete’ or ‘partial’), and the required metadata for each dataset. Complete submissions are those where the processed results are submitted in the PSI standard file formats mzIdentML or mzTab. A web tutorial explaining the process of submission is available at https://www.ebi.ac.uk/training/online/courses/pride-quick-tour/, explaining the main steps for data submission. In 2019, complete submissions containing quantitative information based on the PRIDE XML file format were discontinued and replaced by mzTab-based complete submissions. mzTab (5) is a PSI tab-delimited format that supports the representation of not only identification results but also quantitative results and post-translational modification (PTM) localization information. Since 2019, Mascot (26), MaxQuant (27) and OpenMS (28) can export the resulting identification/quantification results into mzTab. Since 2020, overall, 240 and 30 dataset submissions have been performed using mzTab generated from Mascot and MaxQuant, respectively. Recently, the MaxQuant and PRIDE teams worked together to enable the novel tool MaxDIA (29) to export results from DIA approaches to mzTab. Minor improvements have been done to the PX Submission tool including performance improvements in the OLS Dialog (30) component, which allows searching for ontology/controlled vocabulary terms in the Ontology Lookup Service (https://www.ebi.ac.uk/ols/index). As a key point, file checksums are now computed during the submission and validated by the PRIDE pipelines to ensure the integrity of the submitted files. Two additional improvements have been implemented as part of the submission process: (i) add information about datasets license; and (ii) submission of sample metadata and experimental design information using the newly developed file format MAGE-TAB for proteomics.

Datasets licenses

Licenses for datasets stored in PX resources had not been originally defined or agreed upon (3). In 2020, PX partners decided to move towards a default Creative Commons CC0 license as a minimum level for each dataset, making it available globally datasets without any restrictions. PRIDE used to follow the EMBL-EBI ‘Terms of use’ (https://www.ebi.ac.uk/about/terms-of-use). The CC0 license can only be ensured for prospective newly submitted datasets since 2020. It is expected that for PRIDE, a CC0 license will be the default one in the foreseeable future, in parallel to the policy in other EMBL-EBI resources.

MAGE-TAB for proteomics: improving sample metadata and experimental design

For every submitted dataset to PRIDE Archive, general metadata about the study must be provided including the title, submitters’ details, dataset description, sample and data protocols, instrument, and the associated publication once it is published (2,4,8). It has been highlighted multiple times (31–33) how the lack of appropriate metadata at the sample level, including the experimental design (e.g. samples treatment, fractionation steps, etc.), prevents a more streamlined reuse of the available data, especially in the case of reanalyses of quantitative proteomics datasets. The MAGE-TAB for proteomics (34), an extension of the format original MAGE-TAB format used in transcriptomics (35), has been recently proposed to capture the sample metadata, and the experimental design for proteomics experiments (Figure 2).
Figure 2.

PRIDE Archive users can now provide SDRF-Proteomics files to represent the experimental design and the relationship between the samples analyzed and the instrument raw files. The samples included in the SDRF-Proteomics files are submitted to BioSamples getting each of them a unique accession number. In addition, the PRIDE web interface represents the information contained in SDRF-Proteomics files in an ‘Experimental Design’ table, including all samples and data files.

PRIDE Archive users can now provide SDRF-Proteomics files to represent the experimental design and the relationship between the samples analyzed and the instrument raw files. The samples included in the SDRF-Proteomics files are submitted to BioSamples getting each of them a unique accession number. In addition, the PRIDE web interface represents the information contained in SDRF-Proteomics files in an ‘Experimental Design’ table, including all samples and data files. MAGE-TAB for proteomics has two main components: the Investigation Description Format (IDF) and the Sample and Data Relationship Format (SDRF). The IDF contains the general description of the study which is the same information annotated with the PX Submission tool. Then users do not need to provide it upon submission. The SDRF-Proteomics format includes the representation of the experimental design, and the relationship between the samples analyzed in the experiment and the MS data files (raw files). The SDRF-Proteomics is a tab-delimited format where each column is a property of the sample or the data file. Each row corresponds to the relation between a sample and a data file, and each cell is the value of the property for the sample or the data file (34) (https://github.com/bigbio/proteomics-metadata-standard). SDRF-Proteomics files can now be added manually by the user and selecting the ‘EXPERIMENTAL DESIGN’ as the file type during the submission. Once the data arrives at PRIDE, a BioSample database accession is requested for each sample and added into the BioSample resource (36) (e.g. https://www.ebi.ac.uk/biosamples/samples/SAMEA7710319) via the PRIDE Archive pipelines. In addition, the corresponding experimental design table (e.g. - https://www.ebi.ac.uk/pride/archive/projects/PXD000792) (Figure 2) can be accessed through the PRIDE Archive web interface. As of September 2021, more than 130 public datasets have been re-annotated by third parties (33) and the resulting information is available via PRIDE Archive (https://www.ebi.ac.uk/pride/archive?keyword=sdrf.tsv).

PRIDE Archive web interface and Restful API: accessing proteomics evidences

The PRIDE Restful API (https://www.ebi.ac.uk/pride/ws/archive/v2/) can be used to query and access all the data in PRIDE resources. By using the API it is possible, for example, to query and find datasets by their date of publication, the proteins that have been identified, or the name of a data file within the study (e.g., https://www.ebi.ac.uk/pride/ws/archive/v2/search/projects?keyword=Subject1_FACS145_B_C10). A powerful query language allows users to combine multiple keywords (properties of the project) into an SQL-based query to search datasets. A Python package and tool (https://github.com/PRIDE-Archive/pridepy) have been developed to programmatically interact with the PRIDE Archive Restful API. The package provides a data model for all the data structures provided by the API but also includes functionality that enables to query each endpoint in the API (see https://github.com/PRIDE-Archive/pridepy#examples). The PRIDE Archive web interface provides visualization components that allow to search, find and inspect all the dataset information. A large number of the features from PRIDE Inspector have been moved into the PRIDE web, enabling the inspection of the peptide/protein evidences and the spectra identified in each complete submission (Figure 3). In the results exploration viewer, users can explore the identification results, including the protein coverage in the identified proteins and the mass spectra that are part of each PSM (Peptide Spectrum Match) (Figure 3, https://www.ebi.ac.uk/pride/archive/projects/PXD008613/results?reportedAccession=SPTB2_HUMAN&assayAccession=83415). It is important to highlight that these features are only available for complete submissions.
Figure 3.

The PRIDE web interface provides functionality to assess the quality of each Complete submission, including components to: (A) visualize the sequence coverage of a particular protein; and (B) visualize the spectrum used to identify a given peptide.

The PRIDE web interface provides functionality to assess the quality of each Complete submission, including components to: (A) visualize the sequence coverage of a particular protein; and (B) visualize the spectrum used to identify a given peptide.

PRIDE Spectra Archive: accessing and visualizing all spectra for complete submissions

The public availability and direct access to mass spectra data create the opportunity for scientists to directly assess whether, e.g., a novel peptide evidence, PTM, or amino acid variant (SAAV) are supported by a good-quality and well-annotated mass spectrum (19,37). PSI and PX partners have recently created a novel mechanism to uniquely resolve each mass spectrum in public proteomics resources. The Universal Spectrum Identifier (USI) enables greater transparency of spectral evidence making it more ‘FAIR’ (Findable, Accessible, Interoperable, and Reusable), with more than 1 billion USI identifications from over 3 billion spectra already available through PX repositories (19). The PRIDE Spectra Archive (https://www.ebi.ac.uk/pride/archive/spectra) provides access to over 540 million PSMs (as of September 2021) originally submitted to PRIDE Archive. Users can search by peptide sequences and USIs, enabling them to find specific PSMs from complete submitted datasets. A list of PSMs is shown after the search, including peptide sequences, PTMs, search engine scores, charges, and two additional columns that highlight whether the PSM has passed or not the original analysis threshold and PRIDE internal pipelines thresholds—for example, PSM false discovery rate (FDR) <0.1 computed using the PIA algorithm (24,25). The accession column in the result table provides a direct link to the project result page, where users can check all the results for a given dataset.

PRIDE Peptidome: a condensed view of peptide evidences across PRIDE Archive

PRIDE Peptidome (https://www.ebi.ac.uk/pride/peptidome/) is a resource that groups all PSMs by peptide sequence and the corresponding protein accession. Until recently, the grouping was performed using a spectrum clustering approach (38). However, this approach presented major challenges because each spectrum needed to be compared between each other, prompting performance challenges, due to the continuous and remarkable growth in the amount of submitted data. Although spectrum clustering algorithms have been recently improved using deep-learning models to avoid all the comparisons between all the spectra in the data (39,40), applications of these novel algorithms in large-scale data repositories have not yet been implemented. Instead of spectrum clustering, a novel platform and algorithm (https://github.com/bigbio/sparkms) have been used to select the best-peptide evidence for each peptide and protein combination. The best peptide is selected based on two rules: (i) the peptide passes the peptide FDR threshold for the assay; and (ii) the peptide sequence is longer than seven amino acids. The sparkMS (https://github.com/bigbio/sparkms) used Spark (https://spark.apache.org/) and PySpark to group millions of PSMs in less than 6 hours, which enabled the data analysis of such a large-scale amount of data. The PRIDE Peptidome web interface enables users to search by peptide sequence and protein accession numbers (e.g. https://www.ebi.ac.uk/pride/peptidome/peptidesearch?keyword=SPTB2_HUMAN). The search table shows the sequence for each peptide, protein accession, the number of PSMs across PRIDE Archive, the number of datasets where this peptide has been identified and the best posterior error probability (PEP), as computed by PIA (25). When a given peptide-protein combination is selected, the peptide viewer shows the sequence, the spectrum that justifies the best scored PSM, the list of all PTMs identified, and the corresponding tissues and diseases where the peptide was identified (e.g. https://www.ebi.ac.uk/pride/peptidome/peptidedetails?keyword=DASVAEAWLLGQEPYLSSR&proteinAccession=SPTB2_HUMAN).

PRIDE ARCHIVE SUBMISSION STATISTICS

As of 1 August 2021, PRIDE Archive stored 23 168 datasets—compared to the 10 100 datasets available on August 2018 (2)—, which means that 56.4% of the data in PRIDE Archive has been submitted in the last 3 years. Figure 4 shows the distribution of submissions by month, species, and disease in PRIDE Archive since 2012, and the cumulative size of PRIDE Archive data in terabytes.
Figure 4.

(A) Number of submitted datasets to PRIDE Archive per month (from the beginning of PX in 2012 till August 2021); (B) cumulative size of PRIDE Archive data since 2012; (C) number of submitted datasets per species or taxonomy identifier (as of August 2021). All species that had less than 100 datasets are grouped in one category; (D) distribution of the number of submitted datasets to PRIDE Archive per annotated disease.

(A) Number of submitted datasets to PRIDE Archive per month (from the beginning of PX in 2012 till August 2021); (B) cumulative size of PRIDE Archive data since 2012; (C) number of submitted datasets per species or taxonomy identifier (as of August 2021). All species that had less than 100 datasets are grouped in one category; (D) distribution of the number of submitted datasets to PRIDE Archive per annotated disease. In 2019, PRIDE Archive received 314 datasets per month on average, 436 during 2020, and so far in 2021, this number has grown to 499 datasets on average (Figure 4A), which affirms the increasing huge demand and growing prominence of PRIDE. At the time of writing, PRIDE hosts ∼83% of all PX datasets, coming from >8 000 research groups, from 66 countries. The number of submitted datasets that are now publicly available is currently 64%, reflecting an improvement of around 8% when compared with 2019. With this aim in mind, the team has developed multiple mechanisms to detect datasets already published that have not been reported to PRIDE by the original submitters. As a concrete example, submitters can report via the PRIDE web interface datasets that have already a corresponding manuscript published, if the dataset is still private. The size of PRIDE Archive data has doubled from 2019 to 2021 (Figure 4B). As a result, PRIDE Archive is the third-largest omics Archive at EMBL-EBI only exceeded by the genomics resources ENA (European Nucleotide Archive) and EGA (European Genome-phenome Archive) (41). As of September 2021, the majority of data in PRIDE Archive (including both public and private datasets) are human datasets (including cell lines) (39.1%), followed by mouse (13.7%), Saccharomyces cerevisiae (2.8%), Arabidopsis thaliana (2.7%), Rattus norvegicus (2.5%) and Escherichia coli (2.3%). Whereas most of the datasets come from model organisms, overall, datasets coming from >3 224 different taxonomy identifiers are stored in PRIDE Archive (Supplementary File S1). The number of submitted datasets split by tissues and diseases are more heterogeneous (Figure 4C and D), being ‘cell-culture (non-specific tissue)’, and ‘disease-free (healthy/normal samples)’ the most predominant annotations. Altogether, cancer is the most studied disease followed by Alzheimer's and Parkinson's disease. Importantly, as of September 2021, more than 180 COVID-19 related datasets have been submitted to PRIDE Archive. These datasets, once they become publicly available, are integrated into the EMBL-EBI resource COVID-19 Data Portal (https://www.covid19dataportal.org/), enabling researchers to access all public data at EMBL-EBI resources in a unified interface (42).

PRIDE ARCHIVE AS A HUB OF MS EVIDENCES

Proteomics researchers are increasingly reusing public data from PRIDE (and other PX resources) for a broad range of purposes. For instance, recent resources that have been started by reusing mostly PRIDE public datasets include OpenProt (43), MatrisomeDB (44), Scop3P (45) and ProteomeHD (46). Additionally, as just one among many examples of high-profile data reuse, PRIDE datasets are routinely reanalyzed in the context of the Human Proteome Project (47). Figure 5A shows the increase in volumes of data downloaded from PRIDE Archive since 2013. Recently, PRIDE has started to track the reuse of public PRIDE datasets in publications. This information (if applicable) is available in the dataset web page when clicking on the term ‘Dataset reuses’. Figure 5B shows the increase in manuscripts (including pre-prints) published per year, where PRIDE datasets are reused.
Figure 5.

(A) Volumes of PRIDE Archive data downloads per year, from 2013 to 2020. (B) Number of manuscripts (including pre-prints) per year (2013–2021), where datasets from PRIDE Archive are reused. The figures from 2021 are estimated at the end of the year, according to the existing data at the end of September. It should be noted that the figures represent an underestimation since they only include those manuscripts that could be tracked successfully.

(A) Volumes of PRIDE Archive data downloads per year, from 2013 to 2020. (B) Number of manuscripts (including pre-prints) per year (2013–2021), where datasets from PRIDE Archive are reused. The figures from 2021 are estimated at the end of the year, according to the existing data at the end of September. It should be noted that the figures represent an underestimation since they only include those manuscripts that could be tracked successfully. Rather than in the creation of new resources, for sustainability reasons, our focus in-house has been put in disseminating and integrating PRIDE proteomics data into added-value EMBL-EBI resources such as UniProt (11), Ensembl (10), and Expression Atlas (12). Additionally, we have just started in the first steps of the work required to disseminate and integrate metaproteomics data into MGnify (48), an EMBL-EBI resource for the analysis, archiving, and browsing of metagenomic and metatranscriptomic data. The dissemination of public proteomics data into different resources has different goals depending on each specific resource but can be grouped in three main categories: (i) provide aggregated peptide/protein evidences as originally submitted to PRIDE Archive, in the case of UniProt and Ensembl; (ii) provide peptide/protein evidences, variant sequences and PTM information from reanalyzed datasets to UniProt, Ensembl and in the near future, to MGNify. In this case, an open analysis pipeline is used, including well-defined quality control metrics; and (iii) provide quantitative protein expression information into Expression Atlas, using data coming from reanalyzed datasets.

In-house data reuse: proteogenomics reanalysis integration with Ensembl

Since 2019, PRIDE has started to provide peptide evidences to Ensembl using the ‘TrackHub’ registry (2). More than 4 million canonical peptide sequences, coming from 184 PRIDE public datasets, have been disseminated into Ensembl ‘TrackHubs’ which are available at https://ftp.pride.ebi.ac.uk/pride/data/proteogenomics/latest/archive/. Some obvious benefits of integrating genomics and proteomics data in genome browsers include linking somatic variants and MS evidences and/or gene sequences and PTMs. Recently, we developed a group of tools and workflows to enable large-scale reanalysis of public proteomics data to identify non-canonical peptides (49). Using custom proteogenomics databases created with pgdb (https://github.com/nf-core/pgdb) and the pypgatk (https://github.com/bigbio/py-pgatk) we have managed to identify 43 501 non-canonical peptides and 786 variant peptide sequences in four public datasets.

In-house data reuse: data dissemination into UniProt

Aggregated high-quality evidences (as submitted to PRIDE Archive) are linked to UniProt enabling users to check whether one particular protein has been seen detected in PRIDE Archive. As part of an ongoing effort, we are currently aiming to link all peptide evidences from PRIDE Peptidome to populate the UniProt ProtVista viewer (50). Additionally, we are currently working in the development of infrastructure to reanalyse in a reliable manner, store, visualize and disseminate PTM data (starting with phosphorylation) from PRIDE into UniProt. This is taking place in the context of the ‘PTMeXchange’ project, in collaboration with the PeptideAtlas team and the University of Liverpool. Previously to this more systematic effort, we reanalysed 112 human phospho-enriched datasets, generated from 104 different human cell types or tissues (51). Using a machine learning approach, some of the generated information from the reanalysis together with other sequence features were used to create a single functional score for human phosphosites.

In-house data reuse: integration of quantitative analyses in Expression Atlas

More than 65 quantitative datasets have been annotated, reanalysed and the corresponding results have already or are being integrated into Expression Atlas at the moment of writing. Most of them are DDA label-free datasets, involving cell lines and tumor samples (52), and baseline tissue datasets coming from human, mouse and rat samples. MaxQuant was used as the analysis software in all cases. Additionally, ten SWATH-MS DIA datasets coming mainly from cell line and human tumor samples have also already been re-analysed and integrated into Expression Atlas. In this case, an in-house open analysis pipeline based on OpenSWATH (https://github.com/PRIDE-reanalysis/DIA-reanalysis) was developed and used for the re-analysis (53). These datasets constituted a pilot project to study the feasibility of performing a systematic reanalysis and integration of DIA datasets. Expression Atlas users can now access more comprehensively proteomics expression information in the same interface as gene expression, providing an effective manner of integrating the results of transcriptomics and proteomics experiments.

DISCUSSION AND FUTURE PLANS

Data deposition and dissemination have changed the proteomics community since the creation of PX almost 10 years ago. Most of the proteomics journals require nowadays the authors to deposit their data in a PX resource, which has enabled a better reproducibility and traceability of the claims reported in a given manuscript. The proteomics community is now widely embracing open data policies, an opposite scenario to the situation just a few years ago. At the same time, public proteomics data are being increasingly reused with multiple applications (1). We next outline some of the main working areas for PRIDE in the near future. First of all, PRIDE is raising the bar of metadata annotation for all submitted datasets. MAGE-TAB for proteomics has been created with the aim that every submitted dataset provides information about the sample and the experimental design. The improvement in the annotation is also required to facilitate further data reuse for third parties. We expect that, gradually, the SDRF-Proteomics component will be made required for every dataset submission, after the community understands and get a full idea of the file format and of the mandatory information that needs to be provided. Multiple materials (https://github.com/bigbio/proteomics-metadata-standard/wiki), including examples and video tutorials, have been made available to better understand the file format and how it can be submitted to PRIDE Archive. With the growing importance of clinical proteomics, i.e. in the context of multi-omics studies, another important area is the management of clinical sensitive human proteomics data. Ethical issues in proteomics are starting to be discussed and becoming increasingly relevant. A community-driven white paper on the topic has been recently published describing the current state-of-the-art (54). Addressing ethical issues for genomics and transcriptomics data led to processes to control who may access the data, so-called ‘controlled access’. Resources supporting the storage and dissemination of controlled access DNA/RNA sequencing datasets include the EGA and others internationally such as dbGAP (USA) and the Japanese Genotype-phenotype Archive. At present, all data in PRIDE (and in all PX resources) is fully open. Therefore, there is an increasing number of clinical sensitive human datasets that cannot be made available via PRIDE due to ethical-related issues (55). To address this problem, we will be working in developing a tailored infrastructure for sensitive human proteomics data, and in all the related policies. Additionally, in the context of data archiving activities, we plan to improve the support for cross-linking data - as outlined here (56) - and to provide better data integration for structural proteomics datasets between PRIDE Archive and the Protein Data Bank (PDB). As shown above, we are already working on developing open and reproducible data analysis pipelines for different flavours of proteomics workflows (e.g., DDA, DIA, proteogenomics) (49,53,57). The main rationale is to make possible the use of that software in cloud infrastructures so that in the future the pipelines can be used by the community in the cloud using software container technologies (58). In addition, we aim to increasingly perform in-house data reuse (including data re-analysis) and disseminate high-quality proteomics data from PRIDE into the already mentioned added-value resources (Ensembl, UniProt, Expression Atlas, and MGnify in the near future). In this context, we will also work in improving the PRIDE Archive infrastructure to store dataset reanalyses appropriately, linking them to the relevant resources. One aim is to further develop data dissemination and integration practices also involving resources outside of EMBL-EBI. To finalize, we invite interested parties in PRIDE-related developments to follow the PRIDE Twitter account (@pride_ebi). For regular announcements of all the new publicly available datasets, users can follow the PX Twitter account (@proteomexchange). Click here for additional data file.
  57 in total

1.  PIA: An Intuitive Protein Inference Engine with a Web-Based User Interface.

Authors:  Julian Uszkoreit; Alexandra Maerkens; Yasset Perez-Riverol; Helmut E Meyer; Katrin Marcus; Christian Stephan; Oliver Kohlbacher; Martin Eisenacher
Journal:  J Proteome Res       Date:  2015-06-10       Impact factor: 4.466

2.  Universal Spectrum Explorer: A Standalone (Web-)Application for Cross-Resource Spectrum Comparison.

Authors:  Tobias Schmidt; Patroklos Samaras; Viktoria Dorfer; Christian Panse; Tobias Kockmann; Leon Bichmann; Bart van Puyvelde; Yasset Perez-Riverol; Eric W Deutsch; Bernhard Kuster; Mathias Wilhelm
Journal:  J Proteome Res       Date:  2021-05-10       Impact factor: 4.466

3.  Scop3P: A Comprehensive Resource of Human Phosphosites within Their Full Context.

Authors:  Pathmanaban Ramasamy; Demet Turan; Natalia Tichshenko; Niels Hulstaert; Elien Vandermarliere; Wim Vranken; Lennart Martens
Journal:  J Proteome Res       Date:  2020-06-18       Impact factor: 4.466

4.  Co-regulation map of the human proteome enables identification of protein functions.

Authors:  Georg Kustatscher; Piotr Grabowski; Tina A Schrader; Josiah B Passmore; Michael Schrader; Juri Rappsilber
Journal:  Nat Biotechnol       Date:  2019-11-04       Impact factor: 68.164

5.  MaxDIA enables library-based and library-free data-independent acquisition proteomics.

Authors:  Pavel Sinitcyn; Hamid Hamzeiy; Favio Salinas Soto; Daniel Itzhak; Frank McCarthy; Christoph Wichmann; Martin Steger; Uli Ohmayer; Ute Distler; Stephanie Kaspar-Schoenefeld; Nikita Prianichnikov; Şule Yılmaz; Jan Daniel Rudolph; Stefan Tenzer; Yasset Perez-Riverol; Nagarjuna Nagaraj; Sean J Humphrey; Jürgen Cox
Journal:  Nat Biotechnol       Date:  2021-07-08       Impact factor: 68.164

6.  The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience.

Authors:  Johannes Griss; Andrew R Jones; Timo Sachsenberg; Mathias Walzer; Laurent Gatto; Jürgen Hartler; Gerhard G Thallinger; Reza M Salek; Christoph Steinbeck; Nadin Neuhauser; Jürgen Cox; Steffen Neumann; Jun Fan; Florian Reisinger; Qing-Wei Xu; Noemi Del Toro; Yasset Pérez-Riverol; Fawaz Ghali; Nuno Bandeira; Ioannis Xenarios; Oliver Kohlbacher; Juan Antonio Vizcaíno; Henning Hermjakob
Journal:  Mol Cell Proteomics       Date:  2014-06-30       Impact factor: 5.911

7.  ProtVista: visualization of protein sequence annotations.

Authors:  Xavier Watkins; Leyla J Garcia; Sangya Pundir; Maria J Martin
Journal:  Bioinformatics       Date:  2017-07-01       Impact factor: 6.937

8.  Ensembl 2020.

Authors:  Andrew D Yates; Premanand Achuthan; Wasiu Akanni; James Allen; Jamie Allen; Jorge Alvarez-Jarreta; M Ridwan Amode; Irina M Armean; Andrey G Azov; Ruth Bennett; Jyothish Bhai; Konstantinos Billis; Sanjay Boddu; José Carlos Marugán; Carla Cummins; Claire Davidson; Kamalkumar Dodiya; Reham Fatima; Astrid Gall; Carlos Garcia Giron; Laurent Gil; Tiago Grego; Leanne Haggerty; Erin Haskell; Thibaut Hourlier; Osagie G Izuogu; Sophie H Janacek; Thomas Juettemann; Mike Kay; Ilias Lavidas; Tuan Le; Diana Lemos; Jose Gonzalez Martinez; Thomas Maurel; Mark McDowall; Aoife McMahon; Shamika Mohanan; Benjamin Moore; Michael Nuhn; Denye N Oheh; Anne Parker; Andrew Parton; Mateus Patricio; Manoj Pandian Sakthivel; Ahamed Imran Abdul Salam; Bianca M Schmitt; Helen Schuilenburg; Dan Sheppard; Mira Sycheva; Marek Szuba; Kieron Taylor; Anja Thormann; Glen Threadgold; Alessandro Vullo; Brandon Walts; Andrea Winterbottom; Amonida Zadissa; Marc Chakiachvili; Bethany Flint; Adam Frankish; Sarah E Hunt; Garth IIsley; Myrto Kostadima; Nick Langridge; Jane E Loveland; Fergal J Martin; Joannella Morales; Jonathan M Mudge; Matthieu Muffato; Emily Perry; Magali Ruffier; Stephen J Trevanion; Fiona Cunningham; Kevin L Howe; Daniel R Zerbino; Paul Flicek
Journal:  Nucleic Acids Res       Date:  2020-01-08       Impact factor: 16.971

9.  ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Authors:  Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob
Journal:  Nat Biotechnol       Date:  2014-03       Impact factor: 54.908

10.  jmzTab: a java interface to the mzTab data standard.

Authors:  Qing-Wei Xu; Johannes Griss; Rui Wang; Andrew R Jones; Henning Hermjakob; Juan Antonio Vizcaíno
Journal:  Proteomics       Date:  2014-04-29       Impact factor: 3.984

View more
  280 in total

1.  Development of an in vitro screening system for synthetic signal peptide in mammalian cell-based protein production.

Authors:  Jong-Ho Park; Hoon-Min Lee; Eun-Ju Jin; Eun-Ji Lee; Yeon-Ju Kang; Sungkyun Kim; Sung-Sick Yoo; Gyun Min Lee; Yeon-Gu Kim
Journal:  Appl Microbiol Biotechnol       Date:  2022-05-18       Impact factor: 4.813

2.  Genomic and proteomic analysis of Tausonia pullulans reveals a key role for a GH15 glucoamylase in starch hydrolysis.

Authors:  Andrea Trochine; Nicolás Bellora; Paula Nizovoy; Rosario Duran; Gonzalo Greif; Virginia de García; Carlos Batthyany; Carlos Robello; Diego Libkind
Journal:  Appl Microbiol Biotechnol       Date:  2022-06-17       Impact factor: 4.813

3.  PLK1 inhibition selectively induces apoptosis in ARID1A deficient cells through uncoupling of oxygen consumption from ATP production.

Authors:  Upadhyayula S Srinivas; Norbert S C Tay; Patrick Jaynes; Akshaya Anbuselvan; Gokula K Ramachandran; Joanna D Wardyn; Michal M Hoppe; Phuong Mai Hoang; Yanfen Peng; Sherlly Lim; May Yin Lee; Praveen C Peethala; Omer An; Akshay Shendre; Bryce W Q Tan; Sherlyn Jemimah; Manikandan Lakshmanan; Longyu Hu; Rekha Jakhar; Karishma Sachaphibulkij; Lina H K Lim; Shazib Pervaiz; Karen Crasta; Henry Yang; Patrick Tan; Chao Liang; Lena Ho; Vartika Khanchandani; Dennis Kappei; Wei Peng Yong; David S P Tan; Matteo Bordi; Silvia Campello; Wai Leong Tam; Christian Frezza; Anand D Jeyasekharan
Journal:  Oncogene       Date:  2022-03-02       Impact factor: 9.867

4.  Mechanistic insights into consumption of the food additive xanthan gum by the human gut microbiota.

Authors:  Matthew P Ostrowski; Sabina Leanti La Rosa; Benoit J Kunath; Andrew Robertson; Gabriel Pereira; Live H Hagen; Neha J Varghese; Ling Qiu; Tianming Yao; Gabrielle Flint; James Li; Sean P McDonald; Duna Buttner; Nicholas A Pudlo; Matthew K Schnizlein; Vincent B Young; Harry Brumer; Thomas M Schmidt; Nicolas Terrapon; Vincent Lombard; Bernard Henrissat; Bruce Hamaker; Emiley A Eloe-Fadrosh; Ashootosh Tripathi; Phillip B Pope; Eric C Martens
Journal:  Nat Microbiol       Date:  2022-04-01       Impact factor: 17.745

5.  LRBA is essential for urinary concentration and body water homeostasis.

Authors:  Yu Hara; Fumiaki Ando; Daisuke Oikawa; Koichiro Ichimura; Hideki Yanagawa; Yuriko Sakamaki; Azuma Nanamatsu; Tamami Fujiki; Shuichi Mori; Soichiro Suzuki; Naofumi Yui; Shintaro Mandai; Koichiro Susa; Takayasu Mori; Eisei Sohara; Tatemitsu Rai; Mikiko Takahashi; Sei Sasaki; Hiroyuki Kagechika; Fuminori Tokunaga; Shinichi Uchida
Journal:  Proc Natl Acad Sci U S A       Date:  2022-07-21       Impact factor: 12.779

Review 6.  Paleoproteomics.

Authors:  Christina Warinner; Kristine Korzow Richter; Matthew J Collins
Journal:  Chem Rev       Date:  2022-07-15       Impact factor: 72.087

7.  An integrative-omics analysis of an industrial clavulanic acid-overproducing Streptomyces clavuligerus.

Authors:  Aslıhan Kurt-Kızıldoğan; Gözde Çelik; Eser Ünsaldı; Servet Özcan; Şerife Ayaz-Güner; Gülay Özcengiz
Journal:  Appl Microbiol Biotechnol       Date:  2022-08-10       Impact factor: 5.560

8.  Yeast ORC sumoylation status fine-tunes origin licensing.

Authors:  Gemma Regan-Mochrie; Timothy Hoggard; Nikhil Bhagwat; Gerard Lynch; Neil Hunter; Dirk Remus; Catherine A Fox; Xiaolan Zhao
Journal:  Genes Dev       Date:  2022-08-04       Impact factor: 12.890

9.  Simple and Fast Maximally Deuterated Control (maxD) Preparation for Hydrogen-Deuterium Exchange Mass Spectrometry Experiments.

Authors:  Daniele Peterle; Thomas E Wales; John R Engen
Journal:  Anal Chem       Date:  2022-07-07       Impact factor: 8.008

10.  Top-Down Detection of Oxidative Protein Footprinting by Collision-Induced Dissociation, Electron-Transfer Dissociation, and Electron-Capture Dissociation.

Authors:  Ghazaleh Yassaghi; Zdeněk Kukačka; Jan Fiala; Daniel Kavan; Petr Halada; Michael Volný; Petr Novák
Journal:  Anal Chem       Date:  2022-07-07       Impact factor: 8.008

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.