| Literature DB >> 21258498 |
Erik Alerstam, William Chun Yip Lo, Tianyi David Han, Jonathan Rose, Stefan Andersson-Engels, Lothar Lilge.
Abstract
A highly optimized Monte Carlo (MC) code package for simulating light transport is developed on the latest graphics processing unit (GPU) built for general-purpose computing from NVIDIA - the Fermi GPU. In biomedical optics, the MC method is the gold standard approach for simulating light transport in biological tissue, both due to its accuracy and its flexibility in modelling realistic, heterogeneous tissue geometry in 3-D. However, the widespread use of MC simulations in inverse problems, such as treatment planning for PDT, is limited by their long computation time. Despite its parallel nature, optimizing MC code on the GPU has been shown to be a challenge, particularly when the sharing of simulation result matrices among many parallel threads demands the frequent use of atomic instructions to access the slow GPU global memory. This paper proposes an optimization scheme that utilizes the fast shared memory to resolve the performance bottleneck caused by atomic access, and discusses numerous other optimization techniques needed to harness the full potential of the GPU. Using these techniques, a widely accepted MC code package in biophotonics, called MCML, was successfully accelerated on a Fermi GPU by approximately 600x compared to a state-of-the-art Intel Core i7 CPU. A skin model consisting of 7 layers was used as the standard simulation geometry. To demonstrate the possibility of GPU cluster computing, the same GPU code was executed on four GPUs, showing a linear improvement in performance with an increasing number of GPUs. The GPU-based MCML code package, named GPU-MCML, is compatible with a wide range of graphics cards and is released as an open-source software in two versions: an optimized version tuned for high performance and a simplified version for beginners (http://code.google.com/p/gpumcml).Entities:
Keywords: (170.3660) Light propagation in tissues; (170.5280) Photon migration; (170.7050) Turbid media; (290.4210) Multiple scattering
Year: 2010 PMID: 21258498 PMCID: PMC3018007 DOI: 10.1364/BOE.1.000658
Source DB: PubMed Journal: Biomed Opt Express ISSN: 2156-7085 Impact factor: 3.732
Fig. 1.Left: Flow-chart of the MCML algorithm. Right: Simplified representation used in subsequent sections.
Fig. 3.Parallelization scheme of the GPU-accelerated MCML code. Note that the number of thread blocks Q is matched to the number of SMs available and the number of threads P in each block is a many-to-one mapping (in this case, Q=15 and P=896 for GTX 480).
Fig. 4.The simulation time dependance on the shared memory size used for caching of the high-fluence region of the A[r][z] array. The speedup is compared to the CPU-MCML execution time of 14418 s or ~4 h.
Fig. 5.The simulation time dependance on the shape of the region of array A[r][z] that is cached in the shared memory. Here, NR and NZ are the number of voxels in the r and z dimensions of the cached region respectively. The speedup is compared to the CPU-MCML execution time of 14418 s or ~4 h.
Effect of absorption grid resolution (dr, dz) and grid dimensions (nr, nz) on simulation time, simulating 107 photon packets in the thick homogeneous slab model. The GPU-MCML results were measured on the GTX 480 GPU
| GPU-MCML (s) | CPU-MCML (s) | Speedup | |||
|---|---|---|---|---|---|
| 1 | 1 | 1 | 23.5 | 21817 | 928x |
| 0.1 | 10 | 10 | 23.2 | 22708 | 979x |
| 0.01 | 100 | 100 | 26.0 | 22624 | 870x |
| 0.001 | 1000 | 1000 | 31.5 | 22813 | 724x |
Speedup as a function of the number of GPUs for simulating 108 photon packets in a skin model (λ =600 nm). The speedup is compared to the CPU-MCML execution time of 14418 s or ~4 h. Values in brackets were generated without tracking absorption; only reflectance and transmittance were recorded
| Number of GPUs | Platform Configuration | Time (s) | Speedup |
|---|---|---|---|
| 1 | GTX 295 (using 1 of 2 GPUs) | 73.3 (45.9) | 197x (314x) |
| 2 | GTX 295 (using both GPUs) | 37.2 (23.4) | 388x (616x) |
| 3 | GTX 295 (2 GPUs) + GTX 280 | 24.8 (15.7) | 581x (918x) |
| 4 | GTX 295 (2 GPUs) + 2 x GTX 280 | 18.6 (11.9) | 775x (1212x) |
Execution time of the simplified version of GPU-MCML for 108 photon packets. The speedup is compared to the CPU-MCML execution time of 14418 s or ~4 h. Values in brackets were generated without tracking absorption; only reflectance and transmittance were recorded
| Platform | Simplified GPU-MCML (s) | Speedup |
|---|---|---|
| GeForce GTX 280 | 225.8 (38.3) | 64x (376x) |
| GeForce GTX 480 | 147.9 (17.5) | 97x (824x) |
Tissue optical properties of a seven-layer skin model (λ =600 nm)
| Layer | Thickness (cm) | ||||
|---|---|---|---|---|---|
| 1. stratum corneum | 1.53 | 0.2 | 1000 | 0.9 | 0.002 |
| 2. living epidermis | 1.34 | 0.15 | 400 | 0.85 | 0.008 |
| 3. papillary dermis | 1.4 | 0.7 | 300 | 0.8 | 0.01 |
| 4. upper blood net dermis | 1.39 | 1 | 350 | 0.9 | 0.008 |
| 5. dermis | 1.4 | 0.7 | 200 | 0.76 | 0.162 |
| 6. deep blood net dermis | 1.39 | 1 | 350 | 0.95 | 0.02 |
| 7. subcutaneous fat | 1.44 | 0.3 | 150 | 0.8 | 0.59 |
Effect of GPU architecture on simulation time for 108 photon packets. The speedup was calculated in comparison to the CPU-MCML execution time of 14418 s or ~4 h. Values in brackets were generated without tracking absorption; only reflectance and transmittance were recorded
| Platform (Compute Capability) | No. of SMs | No. of SPs | GPU-MCML (s) | Speedup |
|---|---|---|---|---|
| GeForce GTX 280 (1.3) | 30 | 240 | 60.3 (38.6) | 239x (374x) |
| GeForce GTX 480 (2.0-Fermi) | 15 | 480 | 23.2 (16.6) | 621 x (869x) |
Fig. 6.Simulated fluence distribution and corresponding contour plots in the skin model (108 photon packets) for the impulse response: generated by GPU-MCML (left) and by generated by CPU-MCML (right). Note the logarithmic scale. The first layers are thin and cannot be fully appreciated in this scale, especially as the optical properties are rather similar. Both simulations provide, within statistical uncertainties, the same results.
Fig. 7.Distribution of relative error for the skin model (108 photon packets). Left: GPUMCML vs. CPU-MCML. Right: CPU-MCML vs. CPU-MCML. Color bar represents percent error from 0% to 10%.