| Literature DB >> 30804760 |
William T Katz1, Stephen M Plaza1.
Abstract
Open-source software development has skyrocketed in part due to community tools like github.com, which allows publication of code as well as the ability to create branches and push accepted modifications back to the original repository. As the number and size of EM-based datasets increases, the connectomics community faces similar issues when we publish snapshot data corresponding to a publication. Ideally, there would be a mechanism where remote collaborators could modify branches of the data and then flexibly reintegrate results via moderated acceptance of changes. The DVID system provides a web-based connectomics API and the first steps toward such a distributed versioning approach to EM-based connectomics datasets. Through its use as the central data resource for Janelia's FlyEM team, we have integrated the concepts of distributed versioning into reconstruction workflows, allowing support for proofreader training and segmentation experiments through branched, versioned data. DVID also supports persistence to a variety of storage systems from high-speed local SSDs to cloud-based object stores, which allows its deployment on laptops as well as large servers. The tailoring of the backend storage to each type of connectomics data leads to efficient storage and fast queries. DVID is freely available as open-source software with an increasing number of supported storage options.Entities:
Keywords: EM reconstruction; big data; collaboration; connectomics; dataservice; datastore; distributed version control; versioning
Mesh:
Year: 2019 PMID: 30804760 PMCID: PMC6371063 DOI: 10.3389/fncir.2019.00005
Source DB: PubMed Journal: Front Neural Circuits ISSN: 1662-5110 Impact factor: 3.492
Figure 1Key-value stores are among the simplest databases with few operations. Because of their simplicity, many storage systems can be mapped to key-value interfaces, including file systems where the file path is the key and the value is the file data.
Sample of science HTTP API.
| / | ||
| 1283 voxel subvolume at offset (0, 0, 0). | ||
| for blocks (23, 23, 10) and (23, 24, 10). | ||
| / | ||
| at voxel (100, 100, 47). | ||
| / | ||
| voxels in label 3171. | ||
| / | ||
| of voxels with | ||
| / | ||
| JSON array | ||
| / | ||
| sparse volume. | ||
| / | ||
| POSTed JSON. | ||
| / | ||
| 2003 voxel subvolume at offset (0, 0, 0). | ||
| / | ||
| (38, 21, 33) to (46, 23, 35). | ||
| / | ||
| with label 3171. | ||
| / | ||
| data with key “ | ||
| / | ||
| key-value data using protobuf serialization. | ||
| / | ||
| keys given in the query body as a JSON | ||
| string array. |
Each datatype implements its own HTTP endpoints although similar datatypes (e.g., ones dealing with image volumes) can reuse interfaces like the first “raw” endpoint.
Figure 2High-level view of DVID. Data types within DVID provide a Science API to clients while transforming data to meet a primarily key-value Storage API or proxy data to a connectomics service.
Figure 3Versioning can help train proofreaders without requiring any changes to proofreading tools. After full proofreading (version 8d65f), an interesting neuron is selected and its precursor at the root version c78a0 is assigned for training. Each trainee gets her own branch off the root version, and the reconstructed neuron (e.g., the one depicted in training version a6341) can be compared to version 8d65f.
Figure 4The version DAG of the mushroom body reconstruction as seen through the DVID Console's DAG viewer. Snapshots show (A) zoomed out view showing extent of DAG with significant proofreader training branches near root, and (B) blown up view of leaf at bottom left. Green nodes highlight the “master” branch while the yellow leaf node is the current production version.
Figure 5Each data type persists data using datatype-specific key-value pairs. Key-value pairs for two data instances are shown: a labelmap instance (data id 1) in blue and an annotation instance (data id 2) in red. The datatype-specific component of a key (TKey) could be a block coordinate for a block of voxels. DVID then wraps this TKey, prepending a short data instance identifier and appending a version identifier. A tombstone flag (T) can mark a key-value as deleted in a version without actually deleting earlier versions, as shown for the last key, which marks the deletion of annotations in block coordinate (23, 23, 10) in version 1.
Figure 6Simple example of distribution of key-value pairs across the nodes of a DAG (only keys shown). In this example, segmentation and synapse data for a 6,4003 voxel volume with 1,000 labels is stored in labelmap (blue) and annotation (red) instances at the root version 8fc4. The majority of key-value pairs are ingested at the root and only modified key-value pairs need to be stored for later versions. Several mutation requests are shown with their modified key-value pairs.
Figure 7Scalability of uncompressed grayscale image reads from Google Cloud Store backend. As the number of DVID servers increase, simultaneously requesting non-overlapping image subvolumes from a 16 TeraVoxel dataset, the throughput plateaus just below 1.2 Gigavoxels or 9.6 Gigabits per second. Servers were at the Janelia cluster with 16 real request threads per server, connecting to a Northern Virginia Google Cloud Store through a 10 Gigabits per second connection. The grayscale instance had only one version corresponding to the ingested image (8-bit/voxel) volume.
Figure 8Typical EM reconstructions produce a version DAG with most changes toward the root and fewer, human-guided changes toward the leaf nodes. This means that the bulk of data will be committed and immutable.
Figure 9As shown by software version control systems like git, distributed versioning is an effective workflow for sharing changes via pull requests. The figure depicts a future scenario where the root version at Janelia has been shared with remote collaborators. After changes at the remote site, a pull request is sent back.