| Literature DB >> 25927999 |
Hannes L Röst1, Uwe Schmitt2, Ruedi Aebersold3, Lars Malmström4.
Abstract
MOTIVATION: In mass spectrometry-based proteomics, XML formats such as mzML and mzXML provide an open and standardized way to store and exchange the raw data (spectra and chromatograms) of mass spectrometric experiments. These file formats are being used by a multitude of open-source and cross-platform tools which allow the proteomics community to access algorithms in a vendor-independent fashion and perform transparent and reproducible data analysis. Recent improvements in mass spectrometry instrumentation have increased the data size produced in a single LC-MS/MS measurement and put substantial strain on open-source tools, particularly those that are not equipped to deal with XML data files that reach dozens of gigabytes in size.Entities:
Mesh:
Year: 2015 PMID: 25927999 PMCID: PMC4416046 DOI: 10.1371/journal.pone.0125108
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Available access types of raw mass spectrometric data through OpenMS.
This paper describes the implementation and use of the “random access on disk” (indexed mzML and cached mzML) and the “event-driven” access types and compares it to the previously available “random access in memory” method.
|
|
| |
|---|---|---|
| Random Access in Memory | Read | Fast access to small XML datastructures |
| Write | Fast storage of small XML datastructures | |
| Random Access on Disc (indexed) | Read | Access for infrequent random access |
| Write | Only sequential writing possible (see below) | |
| Random Access on Disc (cached) | Read | Ultra-fast access for frequent random access |
| Write | Only sequential writing possible (see below) | |
| Event-driven processing | Read | Sequential read access suitable for compressed or non-indexed files |
| Write | Sequential write interface for mzML, indexed mzML, cached mzML | |
Fig 1Processing time (s) for the different algorithms.
The time to process a single 6.9 GB file (containing 456 million peaks) is depicted for the different algorithms with and without multithreading enabled. The processing time using the ProteoWizard library is added for comparison. While our in-memory and our event-driven data access routines are substantially faster than data access using the OpenMS 1.11 code, the novel “cached” data format is an order of magnitude faster than all other data access methods. Note that some implementations, such as OpenMS 1.11 or ProteoWizard, are not capable of utilizing multiple threads. Comparisons were performed using the TICCalculator and files described in the text using a single thread or 16 threads for the multithreaded bar.
Fig 2Execution times and memory requirements to calculate the total ion current (TIC) for the described data access implementations.
(a)-(b) Processing time normalized to the number of peaks processed per second for the different different implementations using a single thread (a) or up to 16 threads (b). No error bars for “Cached” are depicted due to graphical reasons (see S1 Text). For comparison purposes, the previous implementation (“Release 1.11”) and the ProteoWizard library also depicted (both only support single-threaded XML parsing). The best performing implementation (“Cached”) provides a speed-up of more than 200-fold when using multiple threads. (c)-(d) Memory requirements (c) and execution times (d) as a function of the number of peaks for the different algorithms. While runtime scales linearly for all algorithms, the “event-driven” and “cached” algorithms have constant memory requirements. The implementations were allowed to use parallel processing (up to 16 threads). Comparisons were performed using the TICCalculator and files described in the text.