| Literature DB >> 29706879 |
Svenn-Arne Dragly1,2, Milad Hobbi Mobarhan1,3, Mikkel E Lepperød1,4, Simen Tennøe1,5, Marianne Fyhn1,3, Torkel Hafting1,4, Anders Malthe-Sørenssen1,2.
Abstract
Natural sciences generate an increasing amount of data in a wide range of formats developed by different research groups and commercial companies. At the same time there is a growing desire to share data along with publications in order to enable reproducible research. Open formats have publicly available specifications which facilitate data sharing and reproducible research. Hierarchical Data Format 5 (HDF5) is a popular open format widely used in neuroscience, often as a foundation for other, more specialized formats. However, drawbacks related to HDF5's complex specification have initiated a discussion for an improved replacement. We propose a novel alternative, the Experimental Directory Structure (Exdir), an open specification for data storage in experimental pipelines which amends drawbacks associated with HDF5 while retaining its advantages. HDF5 stores data and metadata in a hierarchy within a complex binary file which, among other things, is not human-readable, not optimal for version control systems, and lacks support for easy access to raw data from external applications. Exdir, on the other hand, uses file system directories to represent the hierarchy, with metadata stored in human-readable YAML files, datasets stored in binary NumPy files, and raw data stored directly in subdirectories. Furthermore, storing data in multiple files makes it easier to track for version control systems. Exdir is not a file format in itself, but a specification for organizing files in a directory structure. Exdir uses the same abstractions as HDF5 and is compatible with the HDF5 Abstract Data Model. Several research groups are already using data stored in a directory hierarchy as an alternative to HDF5, but no common standard exists. This complicates and limits the opportunity for data sharing and development of common tools for reading, writing, and analyzing data. Exdir facilitates improved data storage, data sharing, reproducible research, and novel insight from interdisciplinary collaboration. With the publication of Exdir, we invite the scientific community to join the development to create an open specification that will serve as many needs as possible and as a foundation for open access to and exchange of data.Entities:
Keywords: Python; analysis; data management; data storage; file format
Year: 2018 PMID: 29706879 PMCID: PMC5909058 DOI: 10.3389/fninf.2018.00016
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Overview of commonly used open formats in neuroscience.
| NWB | Yes | Teeters et al., | |
| Kwik | Yes | Kadir et al., | |
| BRAINformat | Yes | Rübel et al., | |
| Open Ephys | Yes/No | Binary format specifically designed for electrophysiological data. HDF5 optional. | Siegle et al., |
| NeuroShare | No | API to access binary formats and a binary format specifically designed for electrophysiological data. | |
| Neo | N/A | In-memory data format for Python. | Garcia et al., |
| CARMEN NDF | (Yes) | Specifically designed for neuroscience. | |
| Nix | Yes | Adds a layer on top of the abstract data model that standardizes annotation of data. Directory-based backend in development. | Stoewer et al., |
| odML | No | Only applies to metadata. | Grewe et al., |
| NSDF | Yes | Format for neuroscience simulation data | Ray et al., |
Figure 1Overview of an example Exdir directory. File, Group, and Dataset refer to objects in Exdir, and are stored as directories in the file system. These objects are equivalent to the same objects in the HDF5 abstract data model. Raw is specific to Exdir and is a regular directory containing arbitrary data files. Inside each directory, there is a file named exdir.yaml with information about the object type and Exdir version. Each object may contain an attributes.yaml file containing user-defined attributes. Inside the Dataset directory is a file named data.npy that contains the data of the dataset stored in the NumPy binary format.
Exdir format structure.
| Root object | ||||
| Intermediate directory | ||||
| Data | ||||
| Arbitrary data files | ||||
Figure 2Exdir reference implementation class hierarchy.
Figure 3Screenshot of the Exdir browser.
Results from benchmarks comparing performance in Exdir with h5py.
| Add 5 attributes | 0.002 s | 0.010 s | 0.002 s | 0.030 s |
| Add 200 attributes (one by one) | 0.066 s | 3.6 s | 0.068 s | 5.5 s |
| Add 200 attributes (single operation) | N/A | 0.030 s | N/A | 0.049 s |
| Add dataset with 106 64-bit floats | 0.009 s | 0.013 s | 0.019 s | 0.040 s |
| Add dataset with 108 64-bit floats | 0.38 s | 0.54 s | 4.4 s | 0.83 s |
| Create 5,000 groups (thorough validation) | 0.26 s | 8.1 s | 0.36 s | 8.9 s |
| Create 5,000 groups (minimal validation) | 0.26 s | 1.1 s | 0.36 s | 8.4 s |
| Create tree (3 groups × 5 levels) | 0.14 s | 0.34 s | 0.14 s | 2.1 s |
| Write 3D slice (100 × 300 × 100) | 0.0033 s | 0.0048 s | 0.031 s | 0.029 s |
A 2GB RAM disk was used as virtual hard drive for the tests. Filename validation was disabled for Exdir in all tests. Software used: Python 3.6, NumPy 1.13.1, Ubuntu 16.04, Windows 7. Hardware used: Linux: Intel Core i7-5820K 3.30 GHz, 32 GB RAM, Intel 535 512 GB SSD. Windows: HP EliteBook 8570p, Intel Core i7-3520M, 2.90 GHz, 8 GB RAM. Samsung 128 GB SSD.