Literature DB >> 29900110

Public proteomics data: How the field has evolved from sceptical inquiry to the promise of in silico proteomics.

Abstract

Entities: Species

Keywords: ESF, European Science Foundation; GPMDB, Global Proteome Machine Database; MIAPE, minimal information about a proteomics experiment; PRIDE, Proteomics Identifications Database

Year: 2016 PMID： 29900110 PMCID： PMC5988554 DOI： 10.1016/j.euprot.2016.02.005

Source DB: PubMed Journal: EuPA Open Proteom ISSN： 2212-9685

× No keyword cloud information.

At the turn of the millennium, the analysis of proteins in a sample by mass spectrometry experienced a series of technology developments that greatly advanced its analytic reach and power [1]. The key improvements were peptide-centric methods that focused on peptides as the primary analyte rather than the much more diverse and cumbersome proteins that are the biological entity of interest, and the ever faster cycling and increasingly precise mass spectrometers that allowed the highly complex peptide samples to be analysed sufficiently quickly and accurately [2]. It is however, worth noting that these developments were strongly supported by already established bioinformatics infrastructure, most notably the availability of reliable protein sequence databases [3], and of automated search engines that could match an experimental fragmentation spectrum to a peptide sequence obtained after proteolytic digest of these protein sequence databases [4]. Indeed, without these search engines or the databases these rely on, the field would have been incapable of handling the vast amounts of data generated by the new approaches and instruments. Importantly, the much higher throughput achieved from roughly the year 2000 onwards, saw an increasing amount of scrutiny aimed at the results obtained. While few people questioned the data themselves (exhaustive quality control of the data has only recently become a topic of focused interest, see below), many researchers started to wonder about the reliability of the hundreds, and then thousands of peptide identifications that appeared at the end of each analytical run. In response, several papers came out in rapid succession, seeking to understand the behaviour of the existing algorithms on these large bodies of data, and looking for ways to filter out correct identifications from spurious ones [5], [6], [7]. At the same time, the central importance of search engines in shotgun proteomics was further confirmed by the publication of several additional algorithms such as OMSSA [8] and X!Tandem [9]. These tools added to a growing repertoire of software that could be used to provide ever more sophisticated analyses of the acquired data. However, despite the increasing sophistication of proteomics techniques and identification software, the ever growing number of identifications obtained from a single run was received quite sceptical even within the field itself. On the other hand, it also became clear that the wealth of data generated was direly in need of standardization, structured management, and dissemination for re-use. In order to address these two seemingly independent issues, data validation on the one hand, and data management and dissemination on the other hand, Prince et al. [10] were first to state the need for a public proteomics data repository. In their paper, they also introduced an open, online system for storing and sharing proteomics data files. Almost simultaneously, Craig et al. published the Global Proteome Machine Database (GPMDB) system that consisted of a complete data processing pipeline based on the X!Tandem search engine, connected to a relational database to house the results [11]. The next year, Desiere et al. [12] published the PeptideAtlas system that also featured a processing pipeline feeding into a database, while my collaborators and I published the Proteomics Identifications Database (PRIDE) as a submission-driven, structured data repository [13] (see Supplementary Material for the original grant application to develop the PRIDE database, submitted in June 2003). While all very similar in underlying concept, these five efforts at building a repository had different goals at the start. On the one hand, the system developed by Prince et al. shared the true repository focus of PRIDE, with both systems intended for the accurate dissemination of original data and associated results. On the other hand, PeptideAtlas and GPMDB were more focused on re-use of the data from the start. PeptideAtlas placed a strong focus on re-using the reprocessed public data as a means to annotate the (human) genome, while the GPMDB data were used to discover proteotypic peptides [14] and to build spectral libraries [15]. Since then, several additional repositories have been developed [16], [17], [18], and some have also been lost again [19], [20]. The key databases have however, been unified under the ProteomeXchange umbrella, allowing data submission and retrieval to be carried out in a clearly delineated, straightforward way [21]. It is particularly worth noting that all repositories have embraced the concept of data re-use that was so central to the existence of PeptideAtlas and the GPMDB from their inception. For instance, the PRIDE repository now features its own in-house generated spectral libraries [22], and data re-use has enabled researchers to perform an indirect type of crowd-sourcing of data from across the entire proteomics field [23]. Tools and web services to access these online data, such as PRIDE Inspector [24], pride-asap [25], the PRIDE REST service [26] and PeptideShaker [27] have also substantially lowered the threshold to data re-use, allowing any interested user to explore the publicly available data in any way imaginable. The specific role of proteomics as a genome annotation source has also matured over the years, with UniProt listing cross-references to, and annotations from, PeptideAtlas and PRIDE, amongst many other sources. Moreover, dedicated analysis pipelines [28] have recently enabled proteomics data from PRIDE to serve as direct annotation sources for databases describing novel genome features such as long non-coding RNAs [29], [30] and small open reading frames [31]. Indeed, the re-analysis of these publicly available proteomics data now attracts substantial research efforts, and this trend is likely to increase as ever more possible forms of data re-use are put in place (see Vaudel et al. [32] for a review of the possibilities and opportunities). It should be noted however, that the key issue that hampers proteomics data re-use is the lack of sufficient metadata reported along with the original data and results. Indeed, despite the early formulation of the necessary minimal reporting requirements in the form of the Minimal Information About a Proteomics Experiment (MIAPE) [33] and the development of MIAPE-ready standard data formats (notably mzML [34], mzIdentML [35], and mzQuantML [36]), the level of annotation of public data sets remains suboptimal [37]. It should however, be noted that curatorial efforts at the PRIDE database (the most widely used point of submission in the ProteomeXchange consortium) have helped increase the level of core annotation substantially [37]. It is expected that further automation of data submission pipelines (starting from PRIDE Converter in 2009 [38], PRIDE Converter 2 in 2012 [39], and supplemented with the ProteomeXchange submission tool in 2014 [21]) will also make it ever easier for submitters to provide all relevant information along with their original data and results. The future for public proteomics data dissemination is certainly bright, especially because data sharing is strongly encouraged, and increasingly even mandated by important funders such as the Wellcome Trust, the NIH, and the European Commission on the one hand, and by leading journals in the field on the other hand. Along with this ever more solid basic role in the field, public data will continue to evolve. A major new development in the foreseeable will undoubtedly be the integration of quality control metrics along with the submitted data. Indeed, the field has shown an increasing awareness of the importance of quality control over the past few years, with a very strong effort by Rudnick et al. [40] in 2010 as a clear milestone towards much more global quality assessment and assurance. Simultaneously, a dedicated, European Science Foundation (ESF) funded workshop on quality control in proteomics in Cambourne, UK in 2010 [41] delivered several relevant papers in the next year [42]. These initial efforts were followed up by several important publications detailing ways in which to automate the gathering of quality control parameters [43], [44], [45], [46], [47], and perspectives on the importance of establishing robust quality control in the field, notably with an eye to clinical applications [48], [49]. It should be further noted that quality control at the level of the repository [50] and within public data [25], [51] had also been taken up by this time. The final piece of the quality control puzzle has been delivered by the formulation of a generic standard for reporting quality control metrics, in the form of qcML [52], its associated programmatic access libraries [53], and compatible, automated workflows [46]. It is, therefore, only a matter of time before submissions to public repositories will either need to be accompanied by quality control parameters at the time of submission, or will have a standard set of quality control metrics calculated automatically after submission. Public data have clearly come a long way in proteomics, and the current availability of data already provides highly exciting opportunities for re-use. It is noteworthy that the original focus of data validation has thus been superseded with a much more positive outlook: that of the promise of data re-analysis. With ever better metadata annotation, the reach of such re-analyses will moreover only become wider. It can, therefore, be expected that the term in silico proteomics will soon become commonplace, and when this happens, it will be a crucial and highly useful milestone for the field at large.

Conflict of interest

The author declares no conflict of interest.

52 in total

1. Probability-based validation of protein identifications using a modified SEQUEST algorithm.

Authors: Michael J MacCoss; Christine C Wu; John R Yates
Journal: Anal Chem Date: 2002-11-01 Impact factor: 6.986

2. Open mass spectrometry search algorithm.

Authors: Lewis Y Geer; Sanford P Markey; Jeffrey A Kowalak; Lukas Wagner; Ming Xu; Dawn M Maynard; Xiaoyu Yang; Wenyao Shi; Stephen H Bryant
Journal: J Proteome Res Date: 2004 Sep-Oct Impact factor: 4.466

3. LogViewer: a software tool to visualize quality control parameters to optimize proteomics experiments using Orbitrap and LTQ-FT mass spectrometers.

Authors: Michael J Sweredoski; Geoffrey T Smith; Anastasia Kalli; Robert L J Graham; Sonja Hess
Journal: J Biomol Tech Date: 2011-12

4. Using annotated peptide mass spectrum libraries for protein identification.

Authors: R Craig; J C Cortens; D Fenyo; R C Beavis
Journal: J Proteome Res Date: 2006-08 Impact factor: 4.466

Review 5. The minimum information about a proteomics experiment (MIAPE).

Authors: Chris F Taylor; Norman W Paton; Kathryn S Lilley; Pierre-Alain Binz; Randall K Julian; Andrew R Jones; Weimin Zhu; Rolf Apweiler; Ruedi Aebersold; Eric W Deutsch; Michael J Dunn; Albert J R Heck; Alexander Leitner; Marcus Macht; Matthias Mann; Lennart Martens; Thomas A Neubert; Scott D Patterson; Peipei Ping; Sean L Seymour; Puneet Souda; Akira Tsugita; Joel Vandekerckhove; Thomas M Vondriska; Julian P Whitelegge; Marc R Wilkins; Ioannnis Xenarios; John R Yates; Henning Hermjakob
Journal: Nat Biotechnol Date: 2007-08 Impact factor: 54.908

6. NCBI Peptidome: a new public repository for mass spectrometry peptide identifications.

Authors: Douglas J Slotta; Tanya Barrett; Ron Edgar
Journal: Nat Biotechnol Date: 2009-07 Impact factor: 54.908

7. SIMPATIQCO: a server-based software suite which facilitates monitoring the time course of LC-MS performance metrics on Orbitrap instruments.

Authors: Peter Pichler; Michael Mazanek; Frederico Dusberger; Lisa Weilnböck; Christian G Huber; Christoph Stingl; Theo M Luider; Werner L Straube; Thomas Köcher; Karl Mechtler
Journal: J Proteome Res Date: 2012-10-22 Impact factor: 4.466

8. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

9. From Peptidome to PRIDE: public proteomics data migration at a large scale.

Authors: Attila Csordas; Rui Wang; Daniel Ríos; Florian Reisinger; Joseph M Foster; Douglas J Slotta; Juan Antonio Vizcaíno; Henning Hermjakob
Journal: Proteomics Date: 2013-04-20 Impact factor: 3.984

10. An update on LNCipedia: a database for annotated human lncRNA sequences.

Authors: Pieter-Jan Volders; Kenneth Verheggen; Gerben Menschaert; Klaas Vandepoele; Lennart Martens; Jo Vandesompele; Pieter Mestdagh
Journal: Nucleic Acids Res Date: 2014-11-05 Impact factor: 16.971