| Literature DB >> 32633075 |
Anne E Bras1, Vincent H J van der Velden1.
Abstract
When it comes to data storage, the field of flow cytometry is fairly standardized, thanks to the flow cytometry standard (FCS) file format. The structure of FCS files is described in the FCS specification. Software that strictly complies with the FCS specification is guaranteed to be interoperable (in terms of exchange via FCS files). Nowadays, software interoperability is crucial for eco system, as FCS files are frequently shared, and workflows rely on more than one piece of software (e.g., acquisition and analysis software). Ideally, software developers strictly follow the FCS specification. Unfortunately, this is not always the case, which resulted in various nonconformant FCS files being generated over time. Therefore, robust FCS parsers must be developed, which can handle a wide variety of nonconformant FCS files, from different resources. Development of robust FCS parsers would greatly benefit from a fully fledged set of testing files. In this study, readability of 211,359 public FCS files was evaluated. Each FCS file was checked for conformance with the FCS specification. For each data set, within each FCS file, validated parse results were obtained for the TEXT segment. Highly space efficient testing files were generated. FlowCore was benchmarked in depth, by using the validated parse results, the generated testing files, and the original FCS files. Robustness of FlowCore (as measured by testing against 211,359 files) was improved by re-implementing the TEXT segment parser. Altogether, this study provides a comprehensive resource for FCS parser development, an in-depth benchmark of FlowCore, and a concrete proposal for improving FlowCore.Entities:
Keywords: FCS standard; bioinformatics; file format; file parser; flow cytometry
Year: 2020 PMID: 32633075 PMCID: PMC7754493 DOI: 10.1002/cyto.a.24187
Source DB: PubMed Journal: Cytometry A ISSN: 1552-4922 Impact factor: 4.355
Figure 1A. In total, 211,359 publicly available FCS files were downloaded (RAW‐FCS), consisting of 206,620 single data set files (RAW‐SD‐FCS) and 4,739 multi data set files (RAW‐MD‐FCS). Each RAW‐FCS was checked for conformance with the FCS specification. For each RAW‐FCS, multiple artificial FCS files were created (details in B and C), allowing in depth evaluation of various aspects of RAW‐FCS, in a highly space efficient manner. For each data set, within each RAW‐FCS, validated parse results were obtained for the TEXT segment, by aggregating FlowCore and FlowJo parse results. The metadata of 211,359 public FCS files can now be easily explored. Together, the validated parse results and artificial FCS files form a fully fledged testing set for FCS parser development. Performance of FlowCore was benchmarked in‐depth. B. SLICED‐FCS mimic RAW‐SD‐FCS, as they are identical to RAW‐SD‐FCS, except for the bytes after the TEXT segment being removed. INIT‐FCS are newly created FCS files, which contain the initial TEXT segment from RAW‐FCS, preceded by a HEADER with accurate offsets. TEST‐FCS contain the first 20 events of the original DATA segment, and a slightly modified version of the original TEXT segment (reflecting the new DATA segment size). C. BLANKED‐FCS mimic RAW‐MD‐FCS, as they are identical to RAW‐MD‐FCS, except for the bytes from the DATA and ANALYSIS segment being replaced by 0x20 bytes (space character in ASCII). ADD‐FCS are similar to INIT‐FCS, but contain the additional TEXT segment(s) from RAW‐MD‐FCS.
Figure 2HEADER and TEXT segments were checked against the FCS specification. A. Offsets were always right aligned either with leading spaces or leading zeros. Absence of ANALYSIS segments was indicated by empty offsets, negative offsets (incorrect) or both offsets being zero. B. Various invalid offset pairs were found, including invalid ranges (start > end), ranges out‐of‐range (end > size) and overlapping ranges (segments having bytes in common). C. Inaccurate TEXT offsets always resulted in additional bytes being included. D. These were mostly padding bytes and/or bytes from the adjacent segment. E. Various delimiters were used. F. Some TEXT segments featured double closing delimiters. G. Others featured no closing delimiter at all. H. Many empty values were identified, despite being strictly prohibited by the FCS specification. I. Escapes were less frequently found, as compared to empty values. J. FlowCore and FlowJo alternately misinterpreted the consecutive delimiters (empty values vs. escapes).