| Literature DB >> 35021987 |
Miaoshan Lu1,2,3,4, Shaowei An5,6,7,4, Ruimin Wang2,3,7,4, Jinyin Wang1,5,6,4, Changbin Yu8.
Abstract
BACKGROUND: With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate.Entities:
Keywords: Aird; Compressor; DDA; DIA; Mass spectrometry; Metabolomics; Proteomics; ZDPD
Mesh:
Year: 2022 PMID: 35021987 PMCID: PMC8756627 DOI: 10.1186/s12859-021-04490-0
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A Traditional mzXML-like indexing, MS data is sorted by retention time. The retention time and spectrum start position are stored as an index. B In DDA, each MS1 spectrum is followed by 0-N MS2 data. Extracted ion chromatogram is generally calculated in successive MS1 data, so reordering MS1 data to put them in the same physical file block can speed up file reads. C Typically used in SWATH/DIA, every MS2 data belongs to a particular m/z SWATH window. MS2 data in the same window will be frequently accessed and calculated. Reordering is essential for fast file I/O
Fig. 3A Metadata with the same content is stored in JSON and XML format, and Thermo QE and SCIEX TOF generate the files. File size increased by about 65% when transferred from JSON to XML for SCIEX TOF files and 1.7 times for Thermo QE files. B The principle of the ZDPD algorithm. C The compression size is compared with Zlib using each compression algorithm. The ratio increases as the accuracy improve. The ZDPD algorithm shows the best performance. D The encoding time for each algorithm. Due to SIMD support, the FastPfor algorithm is high-speed. E The decoding time for each algorithm. Decoding time is one of the most critical parameters in the computation-oriented process. Although ZDPD is not the fastest, the ZDPD algorithm's performance is the most balanced when combining the compression rate and decoding speed rate
Fig. 2The conversion workflow for AirdPro. The picture shows how AirdPro separated the vendor files and compressed the m/z array and intensity array
Fig. 4Aird's file sizes under different precision are compared with those of their manufacturers. We randomly select files from three manufacturers for comparison
Fig. 5The comparison of file size between five formats in seven datasets. In these datasets. Dataset1 and Dataset2 contain both DDA and DIA files. The size of five datasets in DDA mode. In this mode, the average size of Aird files is 2.0 GB, which is 54% of RAW files. The size of four datasets in DDA mode. The average size of Aird files is 24.6 GB, 77% of RAW files. It means that Aird has the highest compression rate and takes the least storage space among the five formats
The detailed description for each dataset
| Datasets | Mode | Type | Instrument | Data ID |
|---|---|---|---|---|
| DDA-Dataset1 | DDA | Proteomics | TripleTOF 5600 | PXD021390 |
| DIA-Dataset1 | DIA | Proteomics | TripleTOF 5600 | PXD021390 |
| DDA-Dataset2 | DDA | Proteomics | Q-Exactive | PXD018139 |
| DIA-Dataset2 | DIA | Proteomics | Q-Exactive | PXD018139 |
| DDA-Dataset3 | DDA | Proteomics | Q-Exactive | IPX0001509000 |
| DDA-Dataset4 | DDA | Metabolomics | Q-Exactive | MTBLS2119 |
| DDA-Dataset5 | DDA | Metabolomics | TripleTOF 6600 | XCMS1197236 |
| DIA-Dataset6 | DIA | Proteomics | TripleTOF 5600 TripleTOF 6600 | PXD002952 |
| DIA-Dataset7 | DIA | Proteomics | Agilent 6550 | PXD004712 |
The PXD or IPX datasets can be searched on https://www.iprox.cn/, and the MTBL datasets can be found on https://www.ebi.ac.uk/metabolights/, the XCMS datasets can be searched on https://xcmsonline.scripps.edu/