Literature DB >> 25910694

ms-data-core-api: an open-source, metadata-oriented library for computational proteomics.

Yasset Perez-Riverol¹, Julian Uszkoreit², Aniel Sanchez³, Tobias Ternent¹, Noemi Del Toro¹, Henning Hermjakob¹, Juan Antonio Vizcaíno¹, Rui Wang¹.

Abstract

UNLABELLED: The ms-data-core-api is a free, open-source library for developing computational proteomics tools and pipelines. The Application Programming Interface, written in Java, enables rapid tool creation by providing a robust, pluggable programming interface and common data model. The data model is based on controlled vocabularies/ontologies and captures the whole range of data types included in common proteomics experimental workflows, going from spectra to peptide/protein identifications to quantitative results. The library contains readers for three of the most used Proteomics Standards Initiative standard file formats: mzML, mzIdentML, and mzTab. In addition to mzML, it also supports other common mass spectra data formats: dta, ms2, mgf, pkl, apl (text-based), mzXML and mzData (XML-based). Also, it can be used to read PRIDE XML, the original format used by the PRIDE database, one of the world-leading proteomics resources. Finally, we present a set of algorithms and tools whose implementation illustrates the simplicity of developing applications using the library.
AVAILABILITY AND IMPLEMENTATION: The software is freely available at https://github.com/PRIDE-Utilities/ms-data-core-api. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online CONTACT: juan@ebi.ac.uk.

Entities: Chemical

Mesh：

Substances：

Year: 2015 PMID： 25910694 PMCID： PMC4547611 DOI： 10.1093/bioinformatics/btv250

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The Proteomics Standards Initiative (PSI) has developed and actively promotes the use of open standard data formats to represent the data produced in mass spectrometry (MS) based proteomics experiments (including technical and biological metadata). Three of the most broadly used formats are: mzML (Martens ) to capture the ‘primary’ data (the spectra and chromatograms), mzIdentML (Jones ) to report peptide identifications as well as the inferred protein identifications, and the tab-delimited mzTab format (Griss ) that can represent both identification and quantification results. There is increasing interest in new software tools and libraries that can work with these standards. As a result, a set of software libraries in different programming languages has been created (Perez-Riverol ). However, having these independent libraries for different formats can complicate the development of new software. Developers typically have to invest considerable time and effort in basic functionality such as converting data structures between formats, shifting their focus away from the novel aspects of their software. In parallel, the volume of MS proteomics data available in the public domain keeps growing. This presents immense potential for quality assessment and data reanalysis (Perez-Riverol ). A large proportion of the public data is in standard formats, which are heavily promoted by the resources part of ProteomeXchange (Vizcaino ). Here we present the ms-data-core-api, an open-source Java Application Programming Interface (API) to efficiently handle the main data types in MS proteomics workflows, ranging from spectra to peptide/protein identifications to quantitative results. The current version supports three major PSI data standards (mzML, mzIdentML and mzTab) and the majority of mass spectra file formats (mzXML, mzData, mgf, pkl, dta, ms2, apl). This makes ms-data-core-api the first open-source API supporting both identification and quantitation PSI file formats. In addition, as a key feature, it fully supports access to data stored in the PRIDE database (Vizcaino ) by also supporting the PRIDE XML format, providing access to all the projects available in this older format. We also introduce a rapidly growing set of algorithms and tools whose implementation helps to illustrate the simplicity of developing applications based on ms-data-core-api.

2 Design and implementation

The ms-data-core-api library provides a unified access interface to different proteomics MS-derived data types, independent of the format-specific details (Fig. 1). This interface provides methods to access and retrieve information on metadata, chromatograms, spectra, peptide spectrum matches (PSMs), peptides, proteins, protein modifications including post-translational modifications (PTMs) and quantitative results (Supplementary information, section S1.2). The biggest advantage of using the library is that any application based on it is largely file format agnostic. Following a modular design, many independent libraries were grouped at the same dependency level (Supplementary information, Fig. S1). The developed data model provides adapters that can translate the input data from the different source files into the core data structures, enabling the support of widely used formats (Supplementary information, Table S1). Crucially, the output from some of the most used analysis software in the field supports or can be converted to these highly popular formats, and then supported by the library (Fig. 1).

Fig. 1.

Overview of the design of the ms-data-core-api

Overview of the design of the ms-data-core-api The API is composed of four functional components: (i) the data models incorporating all data structures (chromatograms, spectra, peptide/protein identifications, and quantitative information; Supplementary information, section S1.2.2); (ii) the transformers between the native file data representation and the data model; (iii) the cache system warranting fast and efficient access to the data (Supplementary information 1, section S1.2.3); and (iv) the data access controllers that can interface with external tools and libraries. Metadata-driven design: The data model has three major components: proteomics data, features and properties and metadata (Fig. 1). The proteomics data comprise the spectra related information (intensities and masses), peptide/protein identifications (sequences, protein identifiers) and quantitation information. Also, the proteomics data model encodes the associated features (e.g. scores, thresholds, etc) as cvParams, which refer to either a controlled vocabulary (CV) or an ontology (e.g. the PSI-MS CV), and userParams, which are user-defined parameters to represent information not yet included in CVs and ontologies. The API also contains general metadata on the experimental set-up. For example, metadata on the protein identification protocol such as software, enzyme and search database are part of the ProteinDetectionProtocol class. Cache design and PRIDE utilities: The design of the ms-data-core-api aims to achieve an optimal balance between memory consumption and access performance to the data by using a custom caching implementation (Supplementary information, Fig. S3). Most of the data structures in the API are cached as key-value entries depending on access patterns. Data objects are cached as a whole if they are requested frequently, whereas objects requested less often are cached as mappings to their locations within the source file for fast random access. Also, new refinements were introduced in the file format native readers (Supplementary information, Section S1.2.2). The independent PRIDE Utilities module (Fig. 1) provides a cvParam mapper that enables ms-data-core-api to homogenize terms across different file formats. It also includes functions to predict isoelectric point (Perez-Riverol ), monoisotopic mass, and the GRAVY index (Supplementary information, Section S1.2.4). As shown in the code snippet below, the calculation of these properties for all the supported formats takes minimum effort: MzIdentMLControllerImpl controller = new MzIdentMLControllerImpl(new File(“file.mzid”)); Collection proteins = controller.getProteinIds(); for(Comparable id: proteins) for(Comparable pepId: controller.getPeptideIds(id)) double pI= IsoelectricPointUtils.calculate(controller.getPeptideSequence(id,pepId))); Exporting to mzTab: The library includes a range of options to export the core data models and the processed results to mzTab files. The current version enables the conversion from mzIdentML and PRIDE XML to mzTab files including a set of filters to select only the high-quality peptide and protein identifications (Supplementary information, Section S1.2.4). As shown in the code snippet below, the conversion of mzIdentML to mzTab takes minimum effort: MzIdentMLControllerImpl controller = new MzIdentMLControllerImpl(new File(“input.mzid”)); AbstractMzTabConverter mzTabconverter = new MzIdentMLMzTabConverter(controller); MZTabFile mzTabFile = mzTabconverter.getMZTabFile(); MZTabFileConverter checker = new MZTabFileConverter(); checker.check(mzTabFile); mzTabFile.printMZTab(new FileOutputStream(“output.mztab”));

3 Tools and future directions

Various algorithms, tools and pipelines have already been developed using the ms-data-core-api including PRIDE Inspector, the PRIDE internal submission pipeline, HI-bone and the PIA protein inference algorithm, among others (Perez-Riverol ; Vizcaino ; Wang ; Supplementary information, Section S3). The widespread use of the library ensures its stability, continued development, and community support. The ms-data-core-api library is freely available, and is released under the Apache 2.0 license at https://github.com/PRIDE-Utilities/ms-data-core-api.

Funding

Y.P-R. is supported by the BBSRC ‘PROCESS’ grant [BB/K01997X/1]; R.W. by the BBSRC ‘Quantitative Proteomics’ grant [BB/I00095X/1]; T.T. by the BBSRC ‘Proteogenomics’ grant [BB/L024225/1]; J.A.V. and N.d.T. by the Wellcome Trust [grant number WT101477MA] and J. U. by PURE (Protein Unit for Research in Europe), a project of North Rhine-Westphalia, Germany. Conflict of Interest: none declared.

10 in total

1. Isoelectric point optimization using peptide descriptors and support vector machines.

Authors: Yasset Perez-Riverol; Enrique Audain; Aleli Millan; Yassel Ramos; Aniel Sanchez; Juan Antonio Vizcaíno; Rui Wang; Markus Müller; Yoan J Machado; Lazaro H Betancourt; Luis J González; Gabriel Padrón; Vladimir Besada
Journal: J Proteomics Date: 2012-02-03 Impact factor: 4.044

2. HI-bone: a scoring system for identifying phenylisothiocyanate-derivatized peptides based on precursor mass and high intensity fragment ions.

Authors: Yasset Perez-Riverol; Aniel Sánchez; Jesus Noda; Diogo Borges; Paulo Costa Carvalho; Rui Wang; Juan Antonio Vizcaíno; Lázaro Betancourt; Yassel Ramos; Gabriel Duarte; Fabio C S Nogueira; Luis J González; Gabriel Padrón; David L Tabb; Henning Hermjakob; Gilberto B Domont; Vladimir Besada
Journal: Anal Chem Date: 2013-03-20 Impact factor: 6.986

3. mzML--a community standard for mass spectrometry data.

Authors: Lennart Martens; Matthew Chambers; Marc Sturm; Darren Kessner; Fredrik Levander; Jim Shofstahl; Wilfred H Tang; Andreas Römpp; Steffen Neumann; Angel D Pizarro; Luisa Montecchi-Palazzi; Natalie Tasman; Mike Coleman; Florian Reisinger; Puneet Souda; Henning Hermjakob; Pierre-Alain Binz; Eric W Deutsch
Journal: Mol Cell Proteomics Date: 2010-08-17 Impact factor: 5.911

4. PRIDE Inspector: a tool to visualize and validate MS proteomics data.

Authors: Rui Wang; Antonio Fabregat; Daniel Ríos; David Ovelleiro; Joseph M Foster; Richard G Côté; Johannes Griss; Attila Csordas; Yasset Perez-Riverol; Florian Reisinger; Henning Hermjakob; Lennart Martens; Juan Antonio Vizcaíno
Journal: Nat Biotechnol Date: 2012-02-08 Impact factor: 54.908

5. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

Review 6. Making proteomics data accessible and reusable: current state of proteomics databases and repositories.

Authors: Yasset Perez-Riverol; Emanuele Alpi; Rui Wang; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Proteomics Date: 2015-03 Impact factor: 3.984

7. The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience.

Authors: Johannes Griss; Andrew R Jones; Timo Sachsenberg; Mathias Walzer; Laurent Gatto; Jürgen Hartler; Gerhard G Thallinger; Reza M Salek; Christoph Steinbeck; Nadin Neuhauser; Jürgen Cox; Steffen Neumann; Jun Fan; Florian Reisinger; Qing-Wei Xu; Noemi Del Toro; Yasset Pérez-Riverol; Fawaz Ghali; Nuno Bandeira; Ioannis Xenarios; Oliver Kohlbacher; Juan Antonio Vizcaíno; Henning Hermjakob
Journal: Mol Cell Proteomics Date: 2014-06-30 Impact factor: 5.911

8. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013.

Authors: Juan Antonio Vizcaíno; Richard G Côté; Attila Csordas; José A Dianes; Antonio Fabregat; Joseph M Foster; Johannes Griss; Emanuele Alpi; Melih Birim; Javier Contell; Gavin O'Kelly; Andreas Schoenegger; David Ovelleiro; Yasset Pérez-Riverol; Florian Reisinger; Daniel Ríos; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

Review 9. Open source libraries and frameworks for mass spectrometry based proteomics: a developer's perspective.

Authors: Yasset Perez-Riverol; Rui Wang; Henning Hermjakob; Markus Müller; Vladimir Vesada; Juan Antonio Vizcaíno
Journal: Biochim Biophys Acta Date: 2013-03-01

10. ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Authors: Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob
Journal: Nat Biotechnol Date: 2014-03 Impact factor: 54.908

10 in total

17 in total

1. A Fast and Quantitative Method for Post-translational Modification and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Judith A J Steen; Jyoti S Choudhary
Journal: J Vis Exp Date: 2018-05-22 Impact factor: 1.355

2. psims - A Declarative Writer for mzML and mzIdentML for Python.

Authors: Joshua Klein; Joseph Zaia
Journal: Mol Cell Proteomics Date: 2018-12-18 Impact factor: 5.911

3. Highlights of the Biology and Disease-driven Human Proteome Project, 2015-2016.

Authors: Jennifer E Van Eyk; Fernando J Corrales; Ruedi Aebersold; Ferdinando Cerciello; Eric W Deutsch; Paola Roncada; Jean-Charles Sanchez; Tadashi Yamamoto; Pengyuan Yang; Hui Zhang; Gilbert S Omenn
Journal: J Proteome Res Date: 2016-09-20 Impact factor: 4.466

4. Proteomics Standards Initiative: Fifteen Years of Progress and Future Work.

Authors: Eric W Deutsch; Sandra Orchard; Pierre-Alain Binz; Wout Bittremieux; Martin Eisenacher; Henning Hermjakob; Shin Kawano; Henry Lam; Gerhard Mayer; Gerben Menschaert; Yasset Perez-Riverol; Reza M Salek; David L Tabb; Stefan Tenzer; Juan Antonio Vizcaíno; Mathias Walzer; Andrew R Jones
Journal: J Proteome Res Date: 2017-09-15 Impact factor: 4.466

Review 5. A Golden Age for Working with Public Proteomics Data.

Authors: Lennart Martens; Juan Antonio Vizcaíno
Journal: Trends Biochem Sci Date: 2017-01-22 Impact factor: 13.807

6. Fast, Quantitative and Variant Enabled Mapping of Peptides to Genomes.

Authors: Christoph N Schlaffner; Georg J Pirklbauer; Andreas Bender; Jyoti S Choudhary
Journal: Cell Syst Date: 2017-08-23 Impact factor: 10.304

7. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

8. PRIDE Inspector Toolsuite: Moving Toward a Universal Visualization Tool for Proteomics Data Standard Formats and Quality Assessment of ProteomeXchange Datasets.

Authors: Yasset Perez-Riverol; Qing-Wei Xu; Rui Wang; Julian Uszkoreit; Johannes Griss; Aniel Sanchez; Florian Reisinger; Attila Csordas; Tobias Ternent; Noemi Del-Toro; Jose A Dianes; Martin Eisenacher; Henning Hermjakob; Juan Antonio Vizcaíno
Journal: Mol Cell Proteomics Date: 2015-11-06 Impact factor: 5.911

9. The proBAM and proBed standard formats: enabling a seamless integration of genomics and proteomics data.

Authors: Gerben Menschaert; Xiaojing Wang; Andrew R Jones; Fawaz Ghali; David Fenyö; Volodimir Olexiouk; Bing Zhang; Eric W Deutsch; Tobias Ternent; Juan Antonio Vizcaíno
Journal: Genome Biol Date: 2018-01-31 Impact factor: 13.583

10. The mzIdentML Data Standard Version 1.2, Supporting Advances in Proteome Informatics.

Authors: Juan Antonio Vizcaíno; Gerhard Mayer; Simon Perkins; Harald Barsnes; Marc Vaudel; Yasset Perez-Riverol; Tobias Ternent; Julian Uszkoreit; Martin Eisenacher; Lutz Fischer; Juri Rappsilber; Eugen Netz; Mathias Walzer; Oliver Kohlbacher; Alexander Leitner; Robert J Chalkley; Fawaz Ghali; Salvador Martínez-Bartolomé; Eric W Deutsch; Andrew R Jones
Journal: Mol Cell Proteomics Date: 2017-05-17 Impact factor: 5.911