| Literature DB >> 22824549 |
Fan Zhu1, David Rodriguez Gonzalez, Trevor Carpenter, Malcolm Atkinson, Joanna Wardlaw.
Abstract
BACKGROUND ANDEntities:
Mesh:
Year: 2012 PMID: 22824549 PMCID: PMC3778744 DOI: 10.1016/j.cmpb.2012.06.004
Source DB: PubMed Journal: Comput Methods Programs Biomed ISSN: 0169-2607 Impact factor: 5.428
Fig. 1CUDA data and control flow. This figure indicates the four steps of a CUDA data and control flow.
Computation time for SVD (in seconds).
| Matrix size | MATLAB | MKL | GPGPU |
|---|---|---|---|
| 64 × 64 | 0.01 | 0.003 | 0.054 |
| 128 × 128 | 0.03 | 0.014 | 0.077 |
| 256 × 256 | 0.210 | 0.082 | 0.265 |
| 1K × 1K | 72 | 11.255 | 3.725 |
| 2K × 2K | 758.6 | 114.625 | 19.6 |
| 4K × 4K | 6780 | 898.23 | 133.68 |
The column on the left indicates the size of each matrices; the second column is the result for MATLAB and third one is the result for Intel math kernel library LAPACK. The column on the right are the results of Lahabar's work that using GPGPU to decomposition.
Fig. 2Data structure. (a) How data are structured in the source file. (b) The data structure to which it is transformed to maximize localization.
Fig. 3GPGPU parallelization workflow.
Fig. 4GPU kernel program fragment.
Performance of each step.
| Step | Serial running time | Parallel running time (s) | Speedup factor |
|---|---|---|---|
| Brain data load | 0.10 | 0.10 | – |
| Data copying (CPU to GPU) | Not applied | 0.17 | – |
| Data reorganization | 1.1 | 0.01 | 110 |
| Reorganization and denoising | 4.3 | 0.01 | 430 |
| Deconvolution | 2108 | 564 | 3.74 |
| Data copying (GPU to CPU) | Not applied | 0.01 | – |
| Draw parametric maps | 0.20 | 0.20 | – |
| Overall | 2114 | 564 | 3.75 |
This table indicates the processing time of both serial and parallel algorithms for each individual step. Because of Brain data load and draw parametric maps steps are the same in both serial and parallel algorithm and Data copying steps only happen in parallel algorithm, speedup factors are not calculated for these steps.
Overall performance.
| Data size ( | Serial running time (s) | GPGPU running time (s) | OpenMP running time | MPI running time (s) |
|---|---|---|---|---|
| 128 × 128 × 22 × 80 | 2114 | 564 | 956 | 619 |
| (MR image size) | Speedup factor = 3.75 | Speedup factor = 2.21 | Speedup factor = 3.42 | |
| 128 × 128 × 11 × 44 | 360 | 65 | 159 | 94 |
| (CT image size) | Speedup factor = 5.56 | Speedup factor = 2.26 | Speedup factor = 3.84 |
This table indicates the overall running time and speedup factor for all of the serial and parallel implementations.
Fig. 5Threads per block. This figure shows the relationship between the parameter Threads Per Block and processing time. Note that the X-axis is in logarithmic (base 2) scale.
| 1 | |
| 2 | |
| 3 | |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | |
| 9 | |
| 10 | |
| 11 | CBF colored map ← CBF(1:Size) |
| 12 | CBV colored map ← CBV(1:Size) |
| 13 | MTT colored map ← MTT(1:Size) |
| 1 | |
| 2 | |
| 3 | GPU: Parallel do, shared( |
| 4 | |
| 5 | |
| 6 | |
| 7 | |
| 8 | GPU: Parallel do, |
| 9 | |
| 10 | |
| 11 | |
| 12 | |
| 13 | |
| 14 | |
| 15 | |
| 16 | |
| 17 | |
| 18 | |
| 19 | CBF colored map ← |
| 20 | CBV colored map ← |
| 21 | MTT colored map ← |