| Literature DB >> 31214007 |
Alessandro Bria1, Massimo Bernaschi2, Massimiliano Guarrasi3, Giulio Iannello4.
Abstract
Due to the limited field of view of the microscopes, acquisitions of macroscopic specimens require many parallel image stacks to cover the whole volume of interest. Overlapping regions are introduced among stacks in order to make it possible automatic alignment by means of a 3D stitching tool. Since state-of-the-art microscopes coupled with chemical clearing procedures can generate 3D images whose size exceeds the Terabyte, parallelization is required to keep stitching time within acceptable limits. In the present paper we discuss how multi-level parallelization reduces the execution times of TeraStitcher, a tool designed to deal with very large images. Two algorithms performing dataset partition for efficient parallelization in a transparent way are presented together with experimental results proving the effectiveness of the approach that achieves a speedup close to 300×, when both coarse- and fine-grained parallelism are exploited. Multi-level parallelization of TeraStitcher led to a significant reduction of processing times with no changes in the user interface, and with no additional effort required for the maintenance of code.Entities:
Keywords: 3D microscopy; GPU; data partitioning; parallel processing; stitching; terabyte images
Year: 2019 PMID: 31214007 PMCID: PMC6558144 DOI: 10.3389/fninf.2019.00041
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1An NCC map is computed for homologous projections (MIPs) of the two overlapping (blue and red) regions of adjacent tiles.
Figure 2The tile matrix is partitioned in such a way that all alignments among adjacent tiles can be computed exactly once.
Figure 3The solution of algorithm lies in the dashed area between hyperbola y = B/x and line y = (c0 + nm − nx)/m.
Datasets characteristics, parameters of the alignment algorithm, sequential execution times, and speedups attainable with 2, 4, and 8 processors.
| 1 | Whole brain | 223 | 3 × 3 | 2,048 × 2,048 | 2959 | 65 × 65 | 15 | 25680 | 1.94 | 3.69 | 6.58 |
| 2 | Whole brain | 223 | 3 × 3 | 2,048 × 2,048 | 2950 | 65 × 65 | 30 | 38880 | 1.89 | 3.48 | 6.00 |
| 3 | Whole brain | 223 | 3 × 3 | 2,048 × 2,048 | 2950 | 90 × 90 | 15 | 39060 | 1.88 | 3.54 | 6.45 |
| 4 | Hippocampus | 21 | 45 × 44 | 768 × 768 | 6 | 90 × 90 | 1 | 74820 | 1.96 | 3.75 | 7.03 |
| 5 | Cerebellum | 39 | 4 × 10 | 512 × 512 | 3701 | 60 × 60 | 19 | 14340 | 1.86 | 3.59 | 5.55 |
| 6 | 2-photon | 2.65 | 6 × 9 | 1,568 × 1,568 | 10 | 58 × 33 | 1 | 524 | 1.87 | 3.20 | 5.63 |
Performance of experiment #1 measured on the IBM server.
| 1 | 15300 | 1.00 |
| 2 | 8191 | 1.87 |
| 4 | 4100 | 3.73 |
| 8 | 2078 | 7.36 |
| 16 | 1168 | 13.10 |
Speedups obtained using the GPU (Tesla K20c) available on the HP workstation.
| 1 | Whole brain | 25680 | 15.85 | 19.45 |
| 2 | Whole brain | 38880 | 24.00 | 23.14 |
| 3 | Whole brain | 39060 | 14.47 | 25.04 |
| 4 | Hippocampus | 74820 | 31.93 | 30.76 |
| 5 | Cerebellum | 14340 | 23.04 | 28.65 |
| 6 | 2-photon | 524 | 23.82 | 27.58 |
The completion times in seconds are reported from .
Performance of experiment #1 using up to 4 GPUs on the IBM server.
| 1 | 1 | 580 | 26.38 | 1.00 |
| 2 | 2 | 344 | 44.48 | 1.69 |
| 4 | 4 | 174 | 87.93 | 3.33 |
| 8 | 4 | 93 | 164.52 | 6.24 |
| 16 | 4 | 56 | 273.21 | 10.36 |
Both absolute (i.e., wrt the sequential execution without the GPU) and relative (i.e., wrt the sequential execution with GPUs) speedups are reported.
Datasets characteristics, parameters of the fusion algorithm, sequential execution times, and speedups attainable with 2, 4, 6, 8, 10, and 12 processors.
| 1 | Whole brain | 2048 × 2048 | 2959 | 25 | 768 × 768 × 256 | 7470 | 1.81 | 2.25 | 2.29 | 2.05 | n.a. | n.a. |
| 2 | Cerebellum | 512 × 512 | 3701 | 27 | 384 × 384 × 384 | 1886 | 1.81 | 3.18 | 4.76 | 5.10 | 6.83 | 7.23 |
| 3 | Hippocampus | 768 × 768 | 6 | 24 | 512 × 512 × 6 | 760 | 1.82 | 3.29 | 4.22 | 4.47 | 5.20 | 6.08 |
| 4 | 2-photon | 1568 × 1568 | 10 | 7/4 | 768 × 768 × 10 | 75 | 1.92 | 3.13 | 4.17 | 5.00 | 6.25 | 5.36 |
In the “2-photon” dataset the percentage of overlap is different along X and Y dimensions.
Comparison in pairwise displacement computation performance between ParaStitcher and BigStitcher.
| Whole brain | 3903 | 1320 | n.a. |
| Hippocampus | 10643 | 2432 | n.a. |
| Cerebellum | 2584 | 501 | 1699 |
| 2-photon | 93 | 19 | 38 |
All times are measured on the HP workstation and are in seconds.
Comparison in image fusion performance between ParaStitcher and BigStitcher.
| Cerebellum | 3814 | 242 | 261 | 3576 |
| 2-photon | 28 | 10 | 10 | 425 |
All formats, except the single 3D TIFF, are compressed and include multiple resolutions.
All times are measured on the HP workstation and are in seconds.
Comparison between ParaStitcher and BigStitcher (Beta-version) on stitching parallelization.
| Input arrangement | Regular grid of tiles (row-by-row), sparse regular grid by XML specification | Regular grid of tiles (various arrangements), non-regular grid |
| Input format | TIFF (multipage or 2D series), Bitplane Imaris, Hamamatsu DCIMG, OpenCV formats | TIFF (multipage only), Bioformats, Zeiss Lightsheet Z.1, MicroManager diSPIM |
| Multi-channel | Yes | Yes |
| Alignment parallelization | CPU (multi-process) + GPU (CUDA) | CPU (multi-thread) cannot be controlled by user |
| Fusion parallelization | CPU (multi-process) | no |
| Memory usage | < 0.02 × #procs× the size of input dataset | ≈1.5× the size of input dataset |
| Output format | Compressed/uncompressed 8/16 bits 2D TIFF series, Compressed/uncompressed 8/16 bits multipage TIFF | n.a. |
| Output arrangement | Regular grid of tiles (row-by-row), Series of whole 2D slices | n.a. |
measured on the “Cerebellum” and “2-photon” datasets, whereas on the other two datasets “Whole brain” and “Hippocampus” BigStitcher did not complete the alignment step.
we reported only the formats/arrangements that can be effectively generated in parallel by the tools, i.e., sequentially written formats/arrangements have been omitted here.