Literature DB >> 32864978

mzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements.

Ranjeet S Bhamber¹, Andris Jankevics², Eric W Deutsch³, Andrew R Jones⁴, Andrew W Dowsey¹.

Abstract

With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise extensible markup language (XML) representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format "mzMLb" that is optimized for both read/write speed and storage of the raw mass spectrometry data. We provide an extensive validation of the write speed, random read speed, and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb's design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilized by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Entities: Chemical

Keywords: HDF5; Proteomics Standards Initiative; data compression; mass spectrometry; metabolomics; mzML; proteomics

Mesh：

Year: 2020 PMID： 32864978 PMCID： PMC7871438 DOI： 10.1021/acs.jproteome.0c00192

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Through an extensive industry-wide collaborative process, in 2008, the Proteomics Standards Initiative (PSI) established a standardized Extensible Markup Language (XML) representation for raw data interchange in mass spectrometry (MS),[1] “mzML,” further building upon concepts defined in earlier formats mzData and mzXML.[2] mzML is now the pervasive format for interchange and deposition of raw mass spectrometry (MS) proteomics and metabolomics data.[3] However, to provide a detailed, flexible, consistent, and simple standard for the sharing of raw MS data, it was designed around a generic ontology for its representation at the expense of inefficient storage and file access. Two data types are contained within raw mass spectrometry (MS) data sets: (a) numeric data, i.e., mass over charge and spectral/chomatographic intensities; and (b) metadata related to instrument and experimental settings. mzML encodes these data types within a rich, schema-linked XML file, where the metadata is accurately and unambiguously annotated using the PSI-MS controlled vocabulary[4] (CV). However, one of the bottlenecks of mzML’s design is that it is a text-based XML file format and all numeric spectrum data are converted into text strings using Base64 encoding.[5] Optionally, the numeric data can be zlib[6] compressed before encoding, but nevertheless, the sizes of the output files are still 4- to 18-fold higher than the original proprietary vendor format. A number of technologies[6−8] have been developed by various laboratories to address the inherent performance/practical difficulties of utilizing the mzML format for large-volume biological sampled, high-throughput data analysis. The first approach to address the performance and file size issues of mzML was mz5.[6] At the core of mz5 is HDF5[9] (Hierarchical Data Format version 5), originally developed by the National Center for Supercomputing Applications (NCSA) for the storage and organization of large amounts of data. HDF5 is a binary format but is similar to XML in the sense that files are self-describing and allow complex data relationships and dependencies. An HDF5 file allows multiple data sets to be stored within it in a hierarchical group structure akin to folders and files on a file system. The two primary objects represented in HDF5 files are “groups” and “data sets.” Groups are container constructs that are used to hold data sets and other groups. Data sets are multidimensional arrays of data elements of a specific type, e.g., integer, floating point, characters, strings, or a collection of these organized as compound types. Both objects support metadata in the form of attributes (key-value pairs) that can be assigned to each object; these attributes can be of any data type. Using groups, data sets, and attributes, complex structures with diverse data types can be efficiently stored and accessed. Each data set can optionally be subdivided into regular “chunks” to enable more efficient data access, as chunks can be loaded and stored in HDF5’s cache implementation for subsequent repeated access. By changing the chunk size parameter, it is possible to adjust HDF5 for different applications, e.g., fast random access where file size does not matter, or larger chunks for an overall smaller compressed file size. Compared with mzML, mz5’s implementation in HDF5 yields an average file size reduction of 54% and increases linear read and write speeds 3–4-fold.[6] However, mz5 involves a complete reimplementation of mzML accomplished through a complex mapping of mzML tags and binary data to compound HDF5 data sets that mimic tables in a relational database. This structure would need to be explicitly altered to accommodate future versions of mzML. The mapping also precludes a Java implementation using the HDF5 Java application programming interface (API) as compound structures are extremely slow to access with this API. Moreover, some implementation choices are not supported by the Java API at all, specifically the variable-length nested compound structures mz5 uses to describe scan precursors. The mzDB format[7] uses an alternative database paradigm, the lightweight SQLite relational database. mzDB’s main mechanism of increasing random read performance is in organizing data in small two-dimensional blocks across multiple consecutive spectra (i.e., along both the m/z and retention time axis), enabling a quick reading of XICs. In comparison with XML formats, mzDB saves 25% of storage space when compared to mzML, and data access times are improved by 2-fold or more, depending on the data access pattern. Due to its unique data indexing and accessing scheme, three different software libraries have been created to handle MS data sets, two of which are designed to create and handle MS data-dependent acquisition (DDA), the first “pwiz-mzDB” and the second “mzDB-access”. The third instance named “mzDB-Swath” is specifically designed for the data-independent acquisition (DIA) MS–SWATH technique. In addition, mzDB does not compress the text metadata, which is stored in dedicated “param_tree” fields in XML format with specific XML schema definitions (XSDs). mzDB also stores raw data sets uncompressed, but compression can be achieved through an SQLite extension; however, this extension requires a commercial license for both compression and decompression, and comparative results were not presented in the manuscript. As an alternative, a “compressed fitted” mode is proposed, which uses Gaussian Mixture Models to determine the centroid and left/right half width at half-maximum (LHW/RHW) of each peak for reconstruction. This approach has significantly better sensitivity to low-intensity and overlapping peaks than conventional centroiding, but some errors may result in this process, challenging the use of this approach as a permanent record of the raw data for sharing and archival. In summary, mzDB is an excellent format for use as a backend for processing and visualizing mass spectrometry data, as in the authors’ recent Proline software package.[10] As it is not fully integrated into ProteoWizard and there is currently no other mechanism to convert mzDB files back to mzML or other formats, it is currently less suitable as a data sharing and archival format; hence, we do not compare it to mzMLb further in this paper. The imzML data format[11] is predominantly aimed at storing very large mass spectrometry imaging data sets and does so through modest modifications to the mzML format. At the core of this approach is the splitting of XML metadata from the binary encoded data into separate files (*.imzML for the XML metadata and *.ibd for the binary data) and linking them unequivocally using a universally unique identifier (UUID). imzML also introduces new controlled vocabulary (CV) parameters designed specifically to facilitate the use of imaging data. These additional imzML CV parameters, including x/y position, scan direction/pattern, and pixel size, are stored in the *.obo file, following the OBO format 1.2 (which is a text-based format used to describe the CV terms). The approach is designed to enable easier visualization of the data using third-party software. Unlike mz5, mzDB, and imzML, Numpress[8] is an encoding scheme for mzML and not a new or modified file format; its main focus on improving the file size is based on a novel method to compress the binary data in the mzML file before Base64 encoding (note: it does not compress the XML metadata). It accomplishes this by encoding the three common numerical data types present in mzML (mass to charge ratios—m/z, intensities, and retention times) using a variety of heuristics. The first, Numpress Pic (numPic), is intended for ion count data (e.g., from time of flight) and simply rounds the value to the nearest integer for storage in truncation form. The second, Numpress Slof (numSlof), is for general-intensity data and involves a log transformation followed by a multiplication by a scaling factor and then conversion and truncation to an integer. This ensures an approximately constant relative error; the authors demonstrate that choosing the threshold to yield a relative error of <2 × 10–4 did not noticeably affect downstream analysis results. The third approach, Numpress Lin (numLin), is intended specifically for m/z values and uses a fixed-point representation of the value, achieved by multiplying the data by a scaling factor and rounding to the nearest integer. Likewise, a relative error of approximately <2 × 10–9 was deemed not to unduly affect downstream processing. Taken together, Numpress was shown to reduce mzML file size by around 61%, or approximately 86% if the Numpress spectral output was additionally zlib compressed. In the proposed mzMLb format, we adopt the HDF5 format[9] also used by mz5, which is well-established for high-volume data applications. However, rather than using a complex and inflexible mapping between mzML and HDF5, we propose a simple hybrid format where the numeric data are stored natively in HDF5 binary while the metadata is preserved as fully PSI standard mzML and linked to the binary in a manner similar to imzML—but stored within the same HDF5 file. Furthermore, we use only core features of HDF5, making our format compatible with NetCDF4[12] readers and writers (including their native Java library). This enables third-party bioinformatics tool developers to import and export data written in mzMLb using libraries already available on a wide variety of platforms and programming languages in a straightforward way. Taking advantage of the inbuilt HDF5 functionality, we also implement a simple predictive coding method that enables efficient lossy compression that results in file sizes comparable to Numpress but is much easier to implement. Alternatively, Numpress compressed data can be stored in mzMLb without modification. We provide a reference implementation for mzMLb fully integrated into the popular ProteoWizard toolset, available at https://github.com/biospi/pwiz. We demonstrate the advantages of mzMLb using ProteoWizard’s bidirectional mzML, mz5, and Numpress implementations to provide a fully objective benchmark comparison.

Methods

The fundamental design of mzMLb is shown in Figure , with the full specification given in the Supporting Information. As illustrated, an mzMLb HDF5 file is composed of data sets for different data types (numerical and text-based) contained within an mzML file. In this example with our ProteoWizard implementation, the data is stored in four HDF5 data sets: chromatogram start scan times (chromatogram_MS_1000595_double); chromatogram intensities (chromatogram_MS_1000515_float); spectrum m/z’s (spectrum_MS_1000514_double); and spectrum intensities (spectrum_MS_1000515_float). These data sets are accompanied by native HDF5 version mirroring the indexed mzML schema (e.g., mzML_chromatogramIndex and mzML_chromatogramIndex_idRef). It illustrates how mzMLb utilizes the advantages of mzML (XML) and proprietary binary vendor formats by combining the positive values of both approaches while mitigating the negative traits.

Figure 1

mzMLb internal data structure. All data is stored using standard HDF5 constructs, PSI-standard mzML is maintained, and full XML metadata is stored, along with binary data in separate HDF5 data sets. Storage of the chromatogram and spectral data (scan start times, m/z’s, and intensities) is flexible and self-described in terms of floating-point precision and layout, relying simply on the data set name and offset being specified within the tag for each chromatogram and spectrum in the mzML XML metadata. The mzML XML metadata is stored inside a HDF5 character array data set “mzMLb.” This is identical to the mzML format except for the following: (a) The binary data is not stored within the tags; instead, the binary tag provides attributes for the name of the HDF5 data set containing the binary data, and the offset within the HDF5 data set where the data is located. This mechanism is also used in imzML and results in valid mzML. (b) If mzML spectrum and chromatogram indices are desired (i.e., an block in mzML), they are represented instead by native HDF5 data sets “mzML_spectrumIndex” and “mzML_chromatogramIndex,” which are one-dimensional arrays of 64-bit integers pointing to the start byte of each spectrum/chromatogram in the mzMLb. In addition, spectrum/chromatogram identifiers, spot ID (an identifier for the spot from which this spectrum was derived, if a matrix-assisted laser desorption ionization (MALDI) or similar run), and scan start time indices can be specified as further HDF5 data sets (see the Supporting Information). All numerical data that is Base64 encoded in mzML (m/z’s, intensities, etc.) is instead stored in mzMLb as native HDF5 data sets, either as floating-point (32-bit or 64-bit) or as a generic byte array if Numpress encoded. As each tag in the mzMLb data set specifies the name of the data set containing the data, each mzMLb implementation has the freedom to organize the binary data as it wishes. Since offsets can be specified, data from multiple spectra can also be colocated within the same HDF5 data set as long as they are of the same data type. This enables mzMLb to harness efficiency gains from HDF5 chunk-based random access and caching schemes and also reduces the file size as data will then be compressed across spectra (which is not possible in mzML). In our ProteoWizard reference implementation of mzMLb, chromatogram and spectrum data are kept apart, but otherwise all data for specific controlled vocabulary parameters (CVParam) are stored in the same data set. For example, in the data set in Figure , spectrum intensity values for all spectra are stored in the data set “spectrum_MS_1000515_float.” We also implemented a simple coding scheme that combines data truncation, a linear prediction method, and use of HDF5’s inbuilt “shuffle” filter to improve the results of a subsequent compression step. The aim of this approach is to exploit the way numerical floating-point data is represented in binary natively on modern computing hardware, resulting in much better compression ratios. The method is lossy but like Numpress is designed only to add relative error at very small parts-per-million that does not affect downstream processing. Compared to Numpress, it is much easier to implement by third-party developers as the encoding and decoding can be implemented in a single line of code. To fully appreciate its function and implementation, a basic understanding of how decimal real numbers are represented as binary floating-point numbers is required. A number in double-precision (64-bit) or single-precision (32-bit) binary floating-point[13] format consists of three parts: a sign, an exponent, and a mantissa, as represented in Figure . The sign bit represents a negative or positive number if set or unset, respectively (blue binary bit in Figure ). The exponent bits represent the scale of the number and hence specify the location of the decimal point within the number (orange binary bits in Figure ). Finally, the mantissa (green binary bits in Figure ) expresses the fractional part of the number—the number of bits in the mantissa hence gives you the number of significant figures. Having more bits in the exponent (11 bits in double precision compared to 8 bits in single precision) allows you to represent a wider range of numbers, whereas more bits in the mantissa (e.g., there are 52 bits in double precision vs 23 bits in single precision) allows more precision. If full precision is not required, then a large number of bits are stored unnecessarily, resulting in unnecessary memory and storage use. This is the case for a significant amount of the numerical data stored in conventional mzML files.

Figure 2

(a) Visual representation of IEEE 754 double-precision (64-bit) floating-point format and IEEE 754 single-precision (32-bit) floating-point format; zeros are represented by empty boxes and ones are populated. (b) Array of floating-point numbers stored conventionally; yellow bytes can be compressed. (c) Same array truncated and stored using the HDF5 shuffle filter leads to higher compressibility. The pink arrows represent the order in which data is compressed; by reshuffling the order, a higher compression ratio can be achieved. We exploit this fact by implementing a simple lossy truncation scheme based on reducing the numbers of mantissa bits used in the floating-point format to represent m/z and intensity values by zeroing insignificant bits, with an example shown in Table . Here, we can see that we do not observe an appreciable drop in the parts-per-million accuracy of the decimal number until after we remove 21 bits from the mantissa, and it can be seen how zeroing more and more bits increases the error as we pass the single-precision (23 bits) mantissa level.

Table 1

Effect of Changing the Number of Bits Representing the Mantissa in a Floating-Point Number and the Associated Errora

The mantissa of a double-precision (64-bit) floating-point number (52 bits in the mantissa) and the mantissa of a single-precision (32-bit) floating-point number (23 bits in the mantissa) both are highlighted in green accordingly. To translate our truncation approach into improved zlib[14] compression, it is necessary to employ HDF5 byte shuffling. In most formats, floating-point numbers are stored consecutively on disk, so zeroed mantissa bits appear in short bursts, as shown in Figure b. The HDF5 shuffle filter rearranges the byte ordering of the data so that it is stored transversely rather than longitudinally, as shown in Figure c. This leads to large numbers of consecutive zeros that can be compressed extremely well. Moreover, further gains are possible by transforming the data so that consecutive values or sets of values are identical, as zlib is designed to compress away repeated patterns. Toward this goal, the mz5 format uses a “delta” prediction scheme that stores the difference between consecutive data points, rather than the data points themselves. This results in floating-point bit patterns (Figure ) that are less likely to change between consecutive data points and hence are more likely to be compressed. We present an improved technique termed “mzLinear” that extends this approach to a linear extrapolation predicting each data point from the two previous data points, with only the error between the prediction and the actual value stored. As there is often a quadratic relationship across m/z values (for example, since there is a quadratic relationship between time-of-flight and m/z for a standard time-of-flight analyzer), the aim of mzLinear is to result in an approximately constant prediction error across the m/z range, which will compress extremely well. In comparison, delta prediction on quadratic data would lead to prediction errors that rise linearly with m/z. The technique and equation to calculate the stored error Δh is depicted in Figure , with the plot showing a numerical series of m/z values exhibiting a quadratic relationship and how the prediction error Δh remains constant for each value.

Figure 3

mzLinear; linear predictor implemented in mzMLb, where m/z = y and the index i = x, both h0 = 0 and h1 = 0 as the first value are stored in the new array and a linear equation can always be derived to intersect the first two points. However, for the rest of the data points, h, where n = 0, 1, 2, ..., N – 2, is calculated by a linear predictor equation based on the previous two points and N is the total number of m/z values. To demonstrate mzMLb across a broad spectrum of proteomics and metabolomics data sets used in different laboratories, we selected a wide variety of MS techniques and instruments from varying vendors. The data sets are depicted in the Supporting Information Table S1; files 1–4 are from ref (8), data files 5–8 are from ref (15), 9 is from ref (16), 10 is from ref (17) 11 is from ref (18), and finally 12 is from ref (19). We tested mzMLb across different MS types, including SWATH-DIA, DDA, and selected reaction monitoring (SRM) data, and from the major vendors including Thermo, Agilent, Sciex, and Waters. Our implementation of mzMLb has been integrated into the open-source cross-platform ProteoWizard software libraries and tools and is available from https://github.com/biospi/pwiz. Hence, the proprietary raw vendor files can be directly converted into mzMLb using the “msconvert” tool.

Results

We first analyze the performance and generalizability of our truncated mzLinear coding method for m/z accuracy. Figure shows the effects of the change of mantissa on the data set, “AgilentQToF”; it can be seen that increasing truncation decreases the file size while having minimal effect on accuracy. The effectiveness of the mzLinear prediction clearly improves the zlib compression rates significantly across the range of possible truncations, as it is able to exploit the quadratic nature of the m/z to time-of-flight relationship.

Figure 4

mzMLb; mantissa truncation of the AgilentQToF data file, with truncation error and file size for both mzLinear enabled and disabled.

mzMLb; mantissa truncation of the AgilentQToF data file, with truncation error and file size for both mzLinear enabled and disabled. The procedure was performed on all data sets tested, and the mantissa values were chosen such that the error induced by truncation would be less than or comparable to Numpress’ default values of <2 × 10–4 and <2 × 10–9 relative errors for the intensities and m/z values, respectively, which according to[8] are small enough so as to have no effect on the output of results on the downstream of a given workflow. The result of the relative errors can be seen in the Supporting Information, Table S2, where mzMLb produces higher compression ratios and hence smaller file sizes. Since mzMLb’s truncation relative error is always less than that of Numpress, the validation that Numpress does not noticeably affect downstream processing[8] also[8] applies to mzML. Moreover, we expand this validation by compressing the AgilentQToF and QExactive data files (shown in Table S1) and processing these files through a Mascot peptide search and protein inference workflow, the results of which can be seen in Figure . For the case of the AgilentQToF data, we found that Numpress was unable to produce exactly the peptide and protein lists as the original uncompressed data file. However, mzMLb was able to produce the same search and inference results for both the peptide/protein list as the original. Here, we can see that the relationship between the peptide E-values of both mzMLb and Numpress against the original data set; mzMLb gives an injective mapping (a straight line) vs the original peptide E-value, whereas the Numpress results are unable to produce the same injective relationship. The discontinuity of the Numpress results can be further illustrated by observing the peptides in the shaded regions of Figure c; the peptides highlighted in the enclosed box α represent the peptides that were present in both the original file and mzMLb but failed to be found in Numpress, whereas the peptides enclosed in the β region are peptides that were found in Numpress results but were not present in both the original and mzMLb results. The number of peptides deviating from the original file can be seen in Figure a; here, we see that Numpress did not perform quite as well as mzMLb as there are a small number of peptide score values deviating from the original data set. In Figure b,d, we can see the results of the same procedure on the QExactive file; here, we can see that mzMLb again produces an injective relationship with the original data set, i.e., producing the same results as the unmodified data set. Numpress in Figure b performs much better with an extremely low number of peptides showing ΔS deviation. However, Numpress is still unable to produce exactly the same results as the original in the Mascot pipeline. It does, however, perform better than the AgilentQToF case and demonstrates that the lossy compression method employed in Numpress is more susceptible to different vendor data files, whereas the mzMLb truncation scheme is more robust to data file vendor variation and able to reproduce the same results as an unmodified data file. We thus take the most conservative truncation values from Table S2 (AgilentQToF truncation values) and apply them as the mzMLb defaults.

Figure 5

Mascot peptide PSM search results of the original data set against both Numpress and mzMLb compression for AgilentQToF and QExactive datafiles. The top two plots show the number of peptides found in Numpress and mzMLb against the original data with the x-axis representing the deviation (ΔS) of the peptide score from the original. (a) For the Agilent file, here were can clearly see a number of peptide scores deviating from the original score for the Numpress case. (b) Results for the QExactive file, where the number of peptides deviating in the Numpress case is much less when compared to the number of matching peptides. In both cases, mzMLb outperforms Numpress and has virtually no peptides deviating from the original. The bottom two plots show the relative E-value performance of both mzMLb and Numpress against the original data set, with (c) depicting the results for Agilent and (d) for the QExactive data file. The HDF5 binary data set chunk size can have a significant impact on access speed and file size. For the AgilentQToF file, Figure compares mzML with zlib, mzML with Numpress + zlib, and mz5 with zlib, to mzMLb across a range of chunk sizes. Figure a demonstrates write performance on a Linux workstation; Dell T5810, Intel Xeon CPU E5-1650 v3, with 32 GB RAM and 3 TB HDD running Ubuntu Linux v18.04. To produce these results, we ran ProteoWizard msconvert 10 times converting the files from vendor format while recording the write duration. However, modern operating systems including the Linux kernel employ a sophisticated file and memory caching system; to mitigate this mechanism accelerating the multiple writes and reads of the data files being tested, we cleared the Linux memory cache after every invocation of msconvert. It can be seen for lower chunking values, mzMLb (with both mzLinear ON/OFF) outperforms the other formats, and only starts to slow for a chunk size of around 512–1024 kb. Figure b shows the relative compression of the files as the chunking size increases (again for both mzLinear on/off). It can be seen that at 1024 kb the benefit of increasing the chunking size for compression of data is that it quickly plateaus while the writing speed deteriorates. The file sizes of mzMLb with mzLinear perform 77% better than mzML + zlib, 25% smaller than mzMLb + Numpress + zlib, and produce a 56% increase in compression when compared to mz5 + zlib.

Figure 6

Chunk size optimization with mzLinear enabled; (a) mzMLb write benchmark times with varying chunk sizes, (b) file size with and without mzMLb enabled with varying sizes of HDF5 chunking, (c) random read benchmarks for singular spectrum access for full chunking size range, and (d) random read benchmarks for sequential block spectrum access for full chunking size range, with the default chunking size of 1024 highlighted in red. To evaluate the read performance of mzMLb, we created a C++ program readBench, which utilizes the PreoteoWizard API and its libraries to ensure the ability to read all file formats consistently under the same software implementation. This command line tool is available from https://github.com/biospi/pwiz. Here, two scenarios were considered, the first accessing a spectrum for the data set 10 000 times at random. The second involved the random reading of 10 sequential spectra selected 1000 times at random, thus giving 10 000 total spectrum accesses. These were also performed 10 times for each data point. The results are depicted in Figure c,d; in both cases, mzMLb outperforms the other file formats while maintaining a smaller file size. Beyond 1024 kb chunk size, the random read time drastically increases. Subsequently, we ran the random read benchmarks again but this time without zlib compression to evaluate use cases where fast access times are paramount and file size is not important. In this test, we include mzMLb in both a lossless and lossy scenario. Here, we introduced the HDF5 BLOSC (http://blosc.org/) plugin to the validation. The aim of BLOSC is to perform modest but extremely fast decompression/compression so that the resulting read/write times are faster than using no compression at all as less data needs to be physically written to disk. It accomplishes this by: utilizing a blocking technique that reduces activity on the system memory bus; transmitting data to the CPU processor cache faster than the traditional, noncompressed, direct memory fetch approach via a memcpy operating system call; and leveraging SIMD instructions (SSE2 and AVX2 for modern CPUs) and multithreading capabilities present in multicore processors. BLOSC has a number of different optimized compression techniques including BloscLZ, LZ4, LZ4HC, Snappy, Zlib, and Zstd. Throughout these tests, we used BLOSC with LZ4HC compression, as we found it to be the most effective in terms of read and write speeds when dealing with MS data sets. In Figure , we depict the results of our high-throughput results (utilizing the same data set as in Figure ) designed to seek out the optimum solution for the fastest access to MS data, in two categories: lossless file formats (Figure a,b) and lossy file formats (Figure c,d). In both cases, we consider both random single-spectra access (Figure a,c) and random block-sequential access (Figure b,d). We can see that in the case of lossless compression (Figure a,b) mzMLb performs better than both mzML and mz5 in both single and block-sequential data access. Moreover, when we utilize mzMLb with BLOSC LZ4HC compression, we can see that it significantly outperforms both mzML and mz5 at virtually all chunking sizes and particularly performs well at around 1024 kb chunks for both single and block-sequential scans. In the case of lossy data sets (Figure c,d), we can see that Numpress has a significant positive impact on random read times for both single and block-sequential data access. Notably, Numpress when coupled with a chunking size in the vicinity of the optimal value performs better when contained within mzMLb rather than mzML. Nevertheless, when we utilize both mzMLb with mzLinear and BLOSC LZ4HC compression, we observe that mzMLb is significantly faster than Numpress in block-sequential data access and is comparable to Numpress within mzMLb for random single-spectra access.

Figure 7

mzMLb; random read benchmarks for both: singular and block sequential, for uncompressed data with and without truncation and Numpress enabled; (a) lossless single-spectrum access, (b) lossless block-sequential access, (c) lossy single-spectrum access, and (d) lossy block-sequential access, with the default chunking size of 1024 highlighted in red. In Figures and S1 (including different types of instruments), we compare the file size and write performance of our new mzMLb file format against vendor raw file, mzML, mz5, and Numpress within both mzML and mzMLb. All results are the average of 10 runs. Here, we also used the optimum mzMLb chunking size of 1024 kb derived from both Figures and 6, which allows mzMLb to possess both a significant compression ratio of the file size and increased performance in both reading and writing of the mass spectrometry file. Depicted in Figure , the colors of the markers represent the different file formats; more specifically, the red represents mz5 files, the gray the mzML files, the blue mzMLb files, and finally the orange the mzMLb files with BLOSC. The shape of the markers represents the different filters applied during the conversion process, e.g., a solid triangle represents data sets without compression, a solid diamond data sets with zlib applied, and a yellow asterisk data sets with mzLinear, truncation, zlib applied, etc. From these results, it can be seen that in all cases, the resulting mzMLb files were significantly smaller than mzML and a similar size to the vendor raw file. Moreover, from Figures and S1, mzMLb can easily be tailored to different use cases (e.g., maximum compression for archiving; lower compression but faster access times for processing, both the yellow asterisk and the solid circle markers (Figure ), representing mzLinear + trunc + zlib and Numpress + zlib, respectively) to maximize the desired performance metric.

Figure 8

Summary data showing file sizes for all data sets using the three formats: mzML, mz5, and mzMLb with six different compression combinations spanning both lossless and lossy configurations. Uncompressed data files are also depicted here along with mzMLb BLOSC, demonstrating that fast access read times can be achieved without sacrificing the file size. The original vendor file sizes are represented by the horizontal dashed line. We also demonstrate the ability of mzMLb to seamlessly integrate new mzML features without any implementation changes. The PSI is currently developing a set of recommendations for encoding data-independent acquisition and ion mobility data in mzML, which include the merging of the set of ion mobility spectra at each retention time to substantially reduce the repetition of metadata. ProteoWizard already implements this feature through msconvert switch “—combineIonMobilitySpectra.” We demonstrate mzML and mzMLb conversion with and without this switch using an extremely large data set acquired using the recent Bruker diaPASEF technique;[20] the results of which are shown in Figure S2. Here, we see that even without combining ion mobility spectra, mzMLb is more effective at storing this type of data because mzMLb compresses the metadata and compresses the numerical data for multiple spectra in the same chunk. When utilizing the switch, the file size is again decreased.

Conclusions

We demonstrate that using a hybrid file format based on storing XML metadata together with native binary data within a HDF5 file, it is possible to improve the data reading/writing speed of raw MS data as well as preserve all related metadata in PSI-compliant mzML in an implicitly future-proof way. The mzMLb file format can be tailored for different applications by changing the chunk size parameter, i.e., it is possible to adjust the format for fast access where file size does not matter, e.g., visualization and processing, or a smaller compressed file size with slower reading/writing times for data archival. As a chunk can contain more than one spectrum of data, compression can occur across spectra, which is not possible in mzML. Our ProteoWizard implementation allows mzMLb parameters to be set for the specific needs of each researcher. We have derived and validated a conservative default value for truncation that does not affect downstream analyses and show that a chunk size of 1024 kb is a good compromise for most applications, providing competitive results across a wide array of data sets. Given the wide range of use cases for mass spectrometry and the broad variety of instrumentation, an interesting future development would be the automatic optimization of mzMLb parameters for each new data set. For example, we would expect that optimal truncation depends on the mass accuracy of the instrument, while the density of the peak pattern for a spectrum would affect the optimal chunk size for Orbitrap instruments, but perhaps not for time-of-flight instruments as these tend to record background counts outside of the peak areas also. As mzMLb utilizes HDF5, we are able to leverage transparent mechanisms for random data access, caching, partial reading or writing, and error checksums and are easily extendable through plugins to support additional filters and compression algorithms. HDF5 also allows the user to add extra information to the data file while still maintaining PSI compatibility, simply by adding extra HDF5 groups and data sets. This allows the user to store other data within the file side by side with the mzMLb data, for example, a version of the data optimized for fast visualization[21] or a blocked layout like mzDB optimized for fast extraction of XICs. The design principles in mzMLb could be used to create a performant HDF5 implementation of PSI’s in-progress mzSpecLib format for spectral libraries[22] (http://psidev.info/mzSpecLib). Existing PSI standard formats mzIdentML, mzQuantML, and mzTab for identification and quantification results could also be trivially encapsulated in HDF5, although optimum compression of numerical data would require an extended mapping as these formats do not utilize Base64 encoded data constructs. As we use the standard features of HDF5, mzMLb is also bit-for-bit compatible with NetCDF4 (which has native Java libraries). This enables it to be easily implemented by third-party processing software, as both HDF5 and NetCDF4 are widely supported across common programming languages including Java. As of v4.5.0, NetCDF also has support to allow mzMLb files to be randomly accessed remotely over the internet (the HDF5 Group has also recently delivered their own implementation of this functionality too), opening up the potential for public repositories to provide new tools for users to efficiently query and visualize their raw data archives.

16 in total

1. mz5: space- and time-efficient storage of mass spectrometry data sets.

Authors: Mathias Wilhelm; Marc Kirchner; Judith A J Steen; Hanno Steen
Journal: Mol Cell Proteomics Date: 2011-09-29 Impact factor: 5.911

2. imzML--a common data format for the flexible exchange and processing of mass spectrometry imaging data.

Authors: Thorsten Schramm; Zoë Hester; Ivo Klinkert; Jean-Pierre Both; Ron M A Heeren; Alain Brunelle; Olivier Laprévote; Nicolas Desbenoit; Marie-France Robbe; Markus Stoeckli; Bernhard Spengler; Andreas Römpp
Journal: J Proteomics Date: 2012-07-26 Impact factor: 4.044

3. The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary.

Authors: Gerhard Mayer; Luisa Montecchi-Palazzi; David Ovelleiro; Andrew R Jones; Pierre-Alain Binz; Eric W Deutsch; Matthew Chambers; Marius Kallhardt; Fredrik Levander; James Shofstahl; Sandra Orchard; Juan Antonio Vizcaíno; Henning Hermjakob; Christian Stephan; Helmut E Meyer; Martin Eisenacher
Journal: Database (Oxford) Date: 2013-03-12 Impact factor: 3.451

Review 4. Expanding the Use of Spectral Libraries in Proteomics.

Authors: Eric W Deutsch; Yasset Perez-Riverol; Robert J Chalkley; Mathias Wilhelm; Stephen Tate; Timo Sachsenberg; Mathias Walzer; Lukas Käll; Bernard Delanghe; Sebastian Böcker; Emma L Schymanski; Paul Wilmes; Viktoria Dorfer; Bernhard Kuster; Pieter-Jan Volders; Nico Jehmlich; Johannes P C Vissers; Dennis W Wolan; Ana Y Wang; Luis Mendoza; Jim Shofstahl; Andrew W Dowsey; Johannes Griss; Reza M Salek; Steffen Neumann; Pierre-Alain Binz; Henry Lam; Juan Antonio Vizcaíno; Nuno Bandeira; Hannes Röst
Journal: J Proteome Res Date: 2018-10-11 Impact factor: 4.466

5. Streaming visualisation of quantitative mass spectrometry data based on a novel raw signal decomposition method.

Authors: Yan Zhang; Ranjeet Bhamber; Isabel Riba-Garcia; Hanqing Liao; Richard D Unwin; Andrew W Dowsey
Journal: Proteomics Date: 2015-03-09 Impact factor: 3.984

6. Wavelet-based peak detection and a new charge inference procedure for MS/MS implemented in ProteoWizard's msConvert.

Authors: William R French; Lisa J Zimmerman; Birgit Schilling; Bradford W Gibson; Christine A Miller; R Reid Townsend; Stacy D Sherrod; Cody R Goodwin; John A McLean; David L Tabb
Journal: J Proteome Res Date: 2014-12-02 Impact factor: 4.466

7. SPE-IMS-MS: An automated platform for sub-sixty second surveillance of endogenous metabolites and xenobiotics in biofluids.

Authors: Xing Zhang; Michelle Romm; Xueyun Zheng; Erika M Zink; Young-Mo Kim; Kristin E Burnum-Johnson; Daniel J Orton; Alex Apffel; Yehia M Ibrahim; Matthew E Monroe; Ronald J Moore; Jordan N Smith; Jian Ma; Ryan S Renslow; Dennis G Thomas; Anne E Blackwell; Glenn Swinford; John Sausen; Ruwan T Kurulugama; Nathan Eno; Ed Darland; George Stafford; John Fjeldsted; Thomas O Metz; Justin G Teeguarden; Richard D Smith; Erin S Baker
Journal: Clin Mass Spectrom Date: 2016-12-29

8. mzDB: a file format using multiple indexing strategies for the efficient analysis of large LC-MS/MS and SWATH-MS data sets.

Authors: David Bouyssié; Marc Dubois; Sara Nasso; Anne Gonzalez de Peredo; Odile Burlet-Schiltz; Ruedi Aebersold; Bernard Monsarrat
Journal: Mol Cell Proteomics Date: 2014-12-11 Impact factor: 5.911

9. Stable isotope-assisted metabolomics for network-wide metabolic pathway elucidation.

Authors: Darren J Creek; Achuthanunni Chokkathukalam; Andris Jankevics; Karl E V Burgess; Rainer Breitling; Michael P Barrett
Journal: Anal Chem Date: 2012-09-25 Impact factor: 6.986

10. Towards Improving Point-of-Care Diagnosis of Non-malaria Febrile Illness: A Metabolomics Approach.

Authors: Saskia Decuypere; Jessica Maltha; Stijn Deborggraeve; Nicholas J W Rattray; Guiraud Issa; Kaboré Bérenger; Palpouguini Lompo; Marc C Tahita; Thusitha Ruspasinghe; Malcolm McConville; Royston Goodacre; Halidou Tinto; Jan Jacobs; Jonathan R Carapetis
Journal: PLoS Negl Trop Dis Date: 2016-03-04

1 in total

1. StackZDPD: a novel encoding scheme for mass spectrometry data optimized for speed and compression ratio.

Authors: Jinyin Wang; Miaoshan Lu; Ruimin Wang; Shaowei An; Cong Xie; Changbin Yu
Journal: Sci Rep Date: 2022-03-30 Impact factor: 4.996

1 in total