Literature DB >> 24659499

jmzTab: a java interface to the mzTab data standard.

Qing-Wei Xu¹, Johannes Griss, Rui Wang, Andrew R Jones, Henning Hermjakob, Juan Antonio Vizcaíno.

Abstract

mzTab is the most recent standard format developed by the Proteomics Standards Initiative. mzTab is a flexible tab-delimited file that can capture identification and quantification results coming from MS-based proteomics and metabolomics approaches. We here present an open-source Java application programming interface for mzTab called jmzTab. The software allows the efficient processing of mzTab files, providing read and write capabilities, and is designed to be embedded in other software packages. The second key feature of the jmzTab model is that it provides a flexible framework to maintain the logical integrity between the metadata and the table-based sections in the mzTab files. In this article, as two example implementations, we also describe two stand-alone tools that can be used to validate mzTab files and to convert PRIDE XML files to mzTab. The library is freely available at http://mztab.googlecode.com.

Entities: Chemical

Keywords: Bioinformatics; Data standard; Java application programming interface; Proteomics Standards Initiative

Mesh：

Year: 2014 PMID： 24659499 PMCID： PMC4230411 DOI： 10.1002/pmic.201300560

Source DB: PubMed Journal: Proteomics ISSN： 1615-9853 Impact factor: 3.984

In the last decade, several vendor-neutral standard data formats were developed by the HUPO Proteomics Standards Initiative (PSI, http://www.psidev.info), to promote data sharing and software development in the field. So far, most of the released data standards related to MS-based proteomics are XML-based: mzML (for MS data) 1, mzIdentML (for peptide and protein identifications) 2, mzQuantML (for quantification data) 3, and TraML (for transition lists in targeted proteomics approaches) 4. During the development of mzIdentML and mzQuantML the main focus was put on providing an accurate and comprehensive representation of the data. This resulted in relatively complex XML schemas, which in some cases, can make it difficult for data consumers to access the information. In addition, no standard format existed for MS-based metabolomics results. To overcome these issues, the mzTab standard (J. Griss et al., paper submitted) was recently developed as a flexible tab-delimited file format, to report proteomics and metabolomics results, including both identification and quantification data. At the time of writing, the first stable version (v1.0) of the standard is about to be released (see the updated specification document at http://mztab.googlecode.com). mzTab has a flexible design which allows the reporting of identification and quantification results at different levels, ranging from a simple summary or subset of the complete information (e.g., the final results) up to fairly comprehensive representation of the results including the experimental design. Many data consumers are only concerned about having access to the final results of a study in an easily accessible format that is compatible with tools like Microsoft Excel® or the R programming language (R Core Team, http://www.R-project.org), among others. For this reason, mzTab is also aimed to make MS proteomics and metabolomics results available to the wider biological community, beyond the field of MS. An mzTab file can have up to five different sections: metadata, protein, peptide, psm (peptide spectrum match), and small molecule. In addition, it can reference the corresponding mass spectra in the relevant external files. There are two types of mzTab files: ‘‘Identification’’ (including peptide, protein, and/or small molecule identifications) and ‘‘Quantification’’ (used for quantification results, and optionally may contain identification results as well). In addition, there are two levels of detail (called ‘‘mode’’) of reporting data: ‘‘Summary’’ and ‘‘Complete.’’ The ‘‘Summary’’ mode can be used to report the final results of a study, for example, reporting data averaged from different replicates. The ‘‘Complete’’ mode is used if detailed experimental information coming from each individual assay/replicate is provided. The experimental design is modeled in a similar way to mzQuantML, including the elements ‘‘study_variable’’, ‘‘assay’’, ‘‘ms_run’’, and ‘‘sample’’ (see mzTab specification document for more details). There are already several tools implementing mzTab. For example, mzTab is in use in the OpenMS Proteomics Pipeline 7 and fully supported by the MSnbase R/Bioconductor package 8. Also, as the first nonproteomics tool, the LipidDataAnalyzer 9 supports the export of mzTab files, including quantitative information extracted from lipidomics MS data. In addition, two prominent data resources, the PRIDE database (for MS-based proteomics data) 10 and MetaboLights (for metabolomics data) 11 make use of the new standard and are planning to use it heavily in the near future. Here we introduce jmzTab, an open source Java application programming interface (API) for reading, writing, and validating mzTab files. This API greatly simplifies accessing the information included in mzTab files, thereby promoting its use and facilitating its support in third party software. Analogous Java libraries were developed before for other PSI standards, such as jmzML 12, jmzIdentML 13, jmzQuantML 14 or jTraML 15. The API is released under the permissive Apache License 2.0, and the source code is available at http://mztab.googlecode.com. In addition, as example implementations of the library, we describe two stand-alone tools supporting validation and conversion functionality for the format. Additionally, mzTab example files that can be used to test the API are available at https://code.google.com/p/mztab/wiki/ExampleFiles. jmzTab is structured in a three-layer architecture: (i) the Core Model Layer, a lightweight independent implementation for maintaining the integrity between the different sections in the file; (ii) the Enhancement Utilities Layer, which provides parsing, validation, and conversion functionality; and (iii) the Standalone Application Layer, which constitutes the centralized graphical user interface (GUI) and command line entry point for the conversion and validation functionality. The main classes of the Core Model are displayed as an UML (Unified Modified Language) diagram in Fig.1. Detailed documentation about how to use the API can be found at https://code.google.com/p/mztab/wiki/jmzTab2 and at http://mztab.googlecode.com/svn/jmztab/trunk/docs/index.html.

Figure 1

Simplified UML diagram of the jmzTab API. The classes are structured in three different layers: the Core Model Layer (highlighted in gray), the Enhancement Utilities Layer (dark-gray), and the Standalone Application Layer (white). See main text for more details. The diagram does not include all the classes and methods. The main principle behind jmzTab Core Model’s design is to provide an independent, light-weight architecture to simplify the integration of the library into different proteomics/metabolomics software applications. In fact, the model can be integrated into external applications without the need for any other third-party packages. The second key feature of the Core Model is the use of a flexible framework to keep the logical data integrity between the metadata and the table-based sections (including the protein, peptide, psm, and small molecule sections). In the jmzTab Core Model, the MZTabFile class is the central entry point to manage the internal relationships among the different sections in the file. It contains three key components: (i) Metadata, which is a mandatory meta model that provides the definitions from the dataset included in the file; (ii) MZTabColumnFactory, a factory class that can be used to generate stable MZTabColumn elements, and to add different optional columns dynamically (e.g., protein and peptide abundance related columns). The Metadata and MZTabColumnFactory constitute the framework for the MZTabFile class; and (iii) Consistency constraints among the different sections of the model. For example, the MZTabFile class supports the iterative modification of the elements ‘‘study_variable’’, ‘‘ms_run’’, ‘‘assay’’, and ‘‘sample’’ assigned numbers (1−n) and its concrete location in the file, maintaining the internal consistency between the metadata section and the optional elements in the table-based sections. These methods are particularly useful when information from different experiments (e.g., from different MS runs) is condensed into a single mzTab file. As mentioned above, in addition to the Core Model, the classes included in the Enhancement Utilities Layer provide mzTab parsing, validation, and conversion related functionality: Parsing and validation: mzTab files can be validated to ensure that they comply with the latest version of the format specification. The process includes two steps. First of all, the basic model architecture is created, including the metadata section and the generation of the table column headers. The second step is the validation of the column rows. The class MZTabFileParser is used to parse and validate the mzTab files. If the validation has completed, an MZTabFile model will be generated. A series of messages are then reported, which can help to diagnose different types of format-related and/or logical (reporting errors related to the logical relationships among the different sections in a file) errors. At the time of writing, there are about sixty types of error messages (https://code.google.com/p/mztab/wiki/jmzTab2_message). Each validation message has a unique identifier and is classified in three levels: Info, Warn, and Error, according to the requirements from the specification document. Conversion from PRIDE XML (internal XML format used by PRIDE) files to mzTab: The library supports the one-to-one conversion from PRIDE XML to mzTab. This functionality is provided since the PRIDE team is starting to make all proteomics results available in this format. In addition, by extending the ConvertProvider class, the current model can be used for the conversion of third-party format files. Converters from mzIdentML and mzQuantML into mzTab are being developed at present making use of this capability, for example, as part of the mzidLibrary 16. The objective is that the researchers will always be able to access the information in PRIDE through the mzTab format, independently of the format used for the data submission. As example implementations, two stand-alone tools were developed. Both tools are included in a zipped file, available to download at https://code.google.com/p/mztab/downloads/list. mzTabGUI, a desktop application which provides mzTab validation functionality and conversion from PRIDE XML files to mzTab, in two different tabs (Fig.2). After the conversion, the tool additionally performs an automatic validation of the mzTab file.

Figure 2

Screenshots of the mzTabGUI tool. (A) Example of validation report. Six error messages are output in the console panel. (B) Example of conversion of one PRIDE XML file to mzTab.

mzTabCLI, a command line interface (CLI) which provides a more flexible way of processing mzTab files in a batch mode. It also includes validation and conversion functionality. Screenshots of the mzTabGUI tool. (A) Example of validation report. Six error messages are output in the console panel. (B) Example of conversion of one PRIDE XML file to mzTab. In addition, it is important to highlight that jmzTab is already integrated and used in other applications such as the LipidDataAnalyzer (http://genome.tugraz.at/lda/), and in an mzQuantML to mzTab converter included in the mzq-lib library (https://mzq-lib.googlecode.com/). The two stand-alone tools described above also play an important role in parsing, converting, and validating mzTab files in these projects. Detailed documentation about how to use the API can be found at https://code.google.com/p/mztab/wiki/jmzTab2 and at http://mztab.googlecode.com/svn/jmztab/trunk/docs/index.html.

Conclusions

We have presented the open-source library jmzTab to support the PSI's new mzTab standard format. The API follows the design principles used in other existing analogous APIs for other PSI standards such as jmzML or jmzIdentML. Since jmzIdentML and jmzML are focussed around XML formats, jmzTab had to be developed from scratch. In addition, we present two stand-alone tools, which make use of the API. It is planned that the mzTab format will be heavily used in prominent resources such as PRIDE and MetaboLights. Therefore, it is expected that jmzTab will be one of the essential pieces of software to facilitate data provision and data access to both resources.

14 in total

1. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation.

Authors: Laurent Gatto; Kathryn S Lilley
Journal: Bioinformatics Date: 2011-11-22 Impact factor: 6.937

2. TOPP--the OpenMS proteomics pipeline.

Authors: Oliver Kohlbacher; Knut Reinert; Clemens Gröpl; Eva Lange; Nico Pfeifer; Ole Schulz-Trieglaff; Marc Sturm
Journal: Bioinformatics Date: 2007-01-15 Impact factor: 6.937

3. jmzML, an open-source Java API for mzML, the PSI standard for MS data.

Authors: Richard G Côté; Florian Reisinger; Lennart Martens
Journal: Proteomics Date: 2010-04 Impact factor: 3.984

4. Lipid Data Analyzer: unattended identification and quantitation of lipids in LC-MS data.

Authors: Jürgen Hartler; Martin Trötzmüller; Chandramohan Chitraju; Friedrich Spener; Harald C Köfeler; Gerhard G Thallinger
Journal: Bioinformatics Date: 2010-12-17 Impact factor: 6.937

5. TraML--a standard format for exchange of selected reaction monitoring transition lists.

Authors: Eric W Deutsch; Matthew Chambers; Steffen Neumann; Fredrik Levander; Pierre-Alain Binz; Jim Shofstahl; David S Campbell; Luis Mendoza; David Ovelleiro; Kenny Helsens; Lennart Martens; Ruedi Aebersold; Robert L Moritz; Mi-Youn Brusniak
Journal: Mol Cell Proteomics Date: 2011-12-12 Impact factor: 5.911

6. The mzIdentML data standard for mass spectrometry-based proteomics results.

Authors: Andrew R Jones; Martin Eisenacher; Gerhard Mayer; Oliver Kohlbacher; Jennifer Siepen; Simon J Hubbard; Julian N Selley; Brian C Searle; James Shofstahl; Sean L Seymour; Randall Julian; Pierre-Alain Binz; Eric W Deutsch; Henning Hermjakob; Florian Reisinger; Johannes Griss; Juan Antonio Vizcaíno; Matthew Chambers; Angel Pizarro; David Creasy
Journal: Mol Cell Proteomics Date: 2012-02-27 Impact factor: 5.911

7. jmzIdentML API: A Java interface to the mzIdentML standard for peptide and protein identification data.

Authors: Florian Reisinger; Ritesh Krishna; Fawaz Ghali; Daniel Ríos; Henning Hermjakob; Juan Antonio Vizcaíno; Andrew R Jones
Journal: Proteomics Date: 2012-03 Impact factor: 3.984

8. The PRoteomics IDEntifications (PRIDE) database and associated tools: status in 2013.

Authors: Juan Antonio Vizcaíno; Richard G Côté; Attila Csordas; José A Dianes; Antonio Fabregat; Joseph M Foster; Johannes Griss; Emanuele Alpi; Melih Birim; Javier Contell; Gavin O'Kelly; Andreas Schoenegger; David Ovelleiro; Yasset Pérez-Riverol; Florian Reisinger; Daniel Ríos; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

9. The mzQuantML data standard for mass spectrometry-based quantitative studies in proteomics.

Authors: Mathias Walzer; Da Qi; Gerhard Mayer; Julian Uszkoreit; Martin Eisenacher; Timo Sachsenberg; Faviel F Gonzalez-Galarza; Jun Fan; Conrad Bessant; Eric W Deutsch; Florian Reisinger; Juan Antonio Vizcaíno; J Alberto Medina-Aunon; Juan Pablo Albar; Oliver Kohlbacher; Andrew R Jones
Journal: Mol Cell Proteomics Date: 2013-04-18 Impact factor: 5.911

10. Tools (Viewer, Library and Validator) that facilitate use of the peptide and protein identification standard format, termed mzIdentML.

Authors: Fawaz Ghali; Ritesh Krishna; Pieter Lukasse; Salvador Martínez-Bartolomé; Florian Reisinger; Henning Hermjakob; Juan Antonio Vizcaíno; Andrew R Jones
Journal: Mol Cell Proteomics Date: 2013-06-28 Impact factor: 5.911

9 in total

Review 1. Development of data representation standards by the human proteome organization proteomics standards initiative.

Authors: Eric W Deutsch; Juan Pablo Albar; Pierre-Alain Binz; Martin Eisenacher; Andrew R Jones; Gerhard Mayer; Gilbert S Omenn; Sandra Orchard; Juan Antonio Vizcaíno; Henning Hermjakob
Journal: J Am Med Inform Assoc Date: 2015-02-28 Impact factor: 4.497

2. The bacterial proteogenomic pipeline.

Authors: Julian Uszkoreit; Nicole Plohnke; Sascha Rexroth; Katrin Marcus; Martin Eisenacher
Journal: BMC Genomics Date: 2014-12-08 Impact factor: 3.969

3. The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience.

Authors: Johannes Griss; Andrew R Jones; Timo Sachsenberg; Mathias Walzer; Laurent Gatto; Jürgen Hartler; Gerhard G Thallinger; Reza M Salek; Christoph Steinbeck; Nadin Neuhauser; Jürgen Cox; Steffen Neumann; Jun Fan; Florian Reisinger; Qing-Wei Xu; Noemi Del Toro; Yasset Pérez-Riverol; Fawaz Ghali; Nuno Bandeira; Ioannis Xenarios; Oliver Kohlbacher; Juan Antonio Vizcaíno; Henning Hermjakob
Journal: Mol Cell Proteomics Date: 2014-06-30 Impact factor: 5.911

4. The mzqLibrary--An open source Java library supporting the HUPO-PSI quantitative proteomics standard.

Authors: Da Qi; Huaizhong Zhang; Jun Fan; Simon Perkins; Addolorata Pisconti; Deborah M Simpson; Conrad Bessant; Simon Hubbard; Andrew R Jones
Journal: Proteomics Date: 2015-07-14 Impact factor: 3.984

Review 5. From chromatogram to analyte to metabolite. How to pick horses for courses from the massive web resources for mass spectral plant metabolomics.

Authors: Leonardo Perez de Souza; Thomas Naake; Takayuki Tohge; Alisdair R Fernie
Journal: Gigascience Date: 2017-07-01 Impact factor: 6.524

6. 2016 update of the PRIDE database and its related tools.

Authors: Juan Antonio Vizcaíno; Attila Csordas; Noemi del-Toro; José A Dianes; Johannes Griss; Ilias Lavidas; Gerhard Mayer; Yasset Perez-Riverol; Florian Reisinger; Tobias Ternent; Qing-Wei Xu; Rui Wang; Henning Hermjakob
Journal: Nucleic Acids Res Date: 2015-11-02 Impact factor: 16.971

7. iProX: an integrated proteome resource.

Authors: Jie Ma; Tao Chen; Songfeng Wu; Chunyuan Yang; Mingze Bai; Kunxian Shu; Kenli Li; Guoqing Zhang; Zhong Jin; Fuchu He; Henning Hermjakob; Yunping Zhu
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

8. mzTab-M: A Data Standard for Sharing Quantitative Results in Mass Spectrometry Metabolomics.

Authors: Nils Hoffmann; Joel Rein; Timo Sachsenberg; Jürgen Hartler; Kenneth Haug; Gerhard Mayer; Oliver Alka; Saravanan Dayalan; Jake T M Pearce; Philippe Rocca-Serra; Da Qi; Martin Eisenacher; Yasset Perez-Riverol; Juan Antonio Vizcaíno; Reza M Salek; Steffen Neumann; Andrew R Jones
Journal: Anal Chem Date: 2019-02-13 Impact factor: 6.986

9. The PRIDE database resources in 2022: a hub for mass spectrometry-based proteomics evidences.

Authors: Yasset Perez-Riverol; Jingwen Bai; Chakradhar Bandla; David García-Seisdedos; Suresh Hewapathirana; Selvakumar Kamatchinathan; Deepti J Kundu; Ananth Prakash; Anika Frericks-Zipper; Martin Eisenacher; Mathias Walzer; Shengbo Wang; Alvis Brazma; Juan Antonio Vizcaíno
Journal: Nucleic Acids Res Date: 2022-01-07 Impact factor: 16.971

9 in total