| Literature DB >> 33075050 |
David Sehnal1,2,3, Sebastian Bittrich4, Sameer Velankar3, Jaroslav Koča1,2, Radka Svobodová1,2, Stephen K Burley4,5,6,7, Alexander S Rose4.
Abstract
3D macromolecular structural data is growing ever more complex and plentiful in the wake of substantive advances in experimental and computational structure determination methods including macromolecular crystallography, cryo-electron microscopy, and integrative methods. Efficient means of working with 3D macromolecular structural data for archiving, analyses, and visualization are central to facilitating interoperability and reusability in compliance with the FAIR Principles. We address two challenges posed by growth in data size and complexity. First, data size is reduced by bespoke compression techniques. Second, complexity is managed through improved software tooling and fully leveraging available data dictionary schemas. To this end, we introduce BinaryCIF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, such as PDBx/mmCIF, while reducing file sizes by more than a factor of two versus gzip compressed CIF files. Moreover, for the largest structures, BinaryCIF provides even better compression-factor ten and four versus CIF files and gzipped CIF files, respectively. Herein, we describe CIFTools, a set of libraries in Java and TypeScript for generic and typed handling of CIF and BinaryCIF files. Together, BinaryCIF and CIFTools enable lightweight, efficient, and extensible handling of 3D macromolecular structural data.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33075050 PMCID: PMC7595629 DOI: 10.1371/journal.pcbi.1008247
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Encodings types supported by the BinaryCIF format.
| Encoding | How it works | Useful for |
|---|---|---|
| Byte Array | Directly store data | Raw data that does not benefit from any encoding |
| Fixed Point | Multiply numeric value by a constant and store it as an integer | Floating point values where precision can be reduced (i.e. coordinate data) |
| Run Length | Store repeating numeric elements as a tuple with the value and the number of repeats | When combined with delta encoding, useful for storing linear identifiers |
| Delta | Instead of storing absolute values, store differences between consecutive elements | Linear identifiers & when combined with fixed point and integer packing, coordinate data |
| Interval Quantization | Store an interval quantized into 256 (8-bit) or 65536 (16-bit) uniformly distributed discrete steps (values are rounded to the closest step) | Experimental (density) data |
| Integer Packing | Represent large values using 8 or 16-bit numbers | Sequences of data where most values are small |
| String Array | Store an array of strings by concatenating all unique strings as pairs of substring indices into the concatenated one. Effectively encodes repeating substrings. | All string data, particularly annotations such as residue names |
Fig 1Compression strategies of BinaryCIF.
The BinaryCIF codec represents diverse data types in a standardized manner: The indices wherein particular strings occur together with float values can be encoded as integer values. Interval Quantization is the only lossy encoding. For integer arrays the most efficient combination of Run Length, Delta, and Integer Packing is detected. This approach allows management of arbitrary data and even columns that are not defined by any schema. MessagePack is employed downstream of BinaryCIF encoding.
Fig 2Archive sizes.
Archive sizes for 154,015 files are given in GB (see S1 Text). Original refers to the content of the original structure files. Pruned resembles the set of information provided by MMTF files (see S1 Table). Use of BinaryCIF yields an archive size similar to MMTF.
Fig 3Large structures.
BinaryCIF provides the most effective compression for the largest structures, enumerated in S2 Table.
Fig 4Read performance of JavaScript implementation.
Average single-threaded parsing time for 154,015 PDB structures is given in minutes. Reading of binary data (BinaryCIF and MMTF) can provide a dramatic speedup. Handling gzipped files slows down parsing in most cases. Read performance can be easily improved by omitting less used meta-information as seen for the pruned bins.
Fig 5Read performance of Java implementation.
Average single-threaded parsing time for 154,015 PDB structures are given in minutes.