| Literature DB >> 22076279 |
Manuel Freiberger, Herbert Egger, Manfred Liebmann, Hermann Scharfetter.
Abstract
Image reconstruction in fluorescence optical tomography is a three-dimensional nonlinear ill-posed problem governed by a system of partial differential equations. In this paper we demonstrate that a combination of state of the art numerical algorithms and a careful hardware optimized implementation allows to solve this large-scale inverse problem in a few seconds on standard desktop PCs with modern graphics hardware. In particular, we present methods to solve not only the forward but also the non-linear inverse problem by massively parallel programming on graphics processors. A comparison of optimized CPU and GPU implementations shows that the reconstruction can be accelerated by factors of about 15 through the use of the graphics hardware without compromising the accuracy in the reconstructed images.Entities:
Keywords: (100.3010) Image reconstruction techniques; (100.3190) Inverse problems; (170.6960) Tomography; (170.7050) Turbid media; (300.6280) Spectroscopy, fluorescence and luminescence
Year: 2011 PMID: 22076279 PMCID: PMC3207387 DOI: 10.1364/BOE.2.003207
Source DB: PubMed Journal: Biomed Opt Express ISSN: 2156-7085 Impact factor: 3.732
Iteratively regularized Gauß-Newton algorithm
| 1: | initialize |
| 2: | |
| 3: | |
| 4: | [ |
| 5: | |
| 6: | |
| 7: | |
| 8: | |
| 9: | |
| 10: | |
| 11: |
|
| 12: | |
| 13: | |
| 14: |
Fig. 1(a) Sparse matrix which is to be stored in compressed row-storage (CRS) format. (b) The conventional memory layout on the CPU consists of the linearized data elements together with their column indices and a vector holding the offset of each row in the data vector. The number of elements in a row is given implicitly by the difference in the row offsets. (c) The adapted layout for the GPU stores the rows in an interleaved manner which requires memory padding marked with × and an additional vector holding the number of non-zero entries per row.
Fig. 2Assembly of the system matrix: The 4 × 4 stiffness-, mass- and boundary mass-matrices K, M, and R of every element are scaled with the optical parameters κ, μ and ρ, summed and stored temporarily into a vector T. In a second step, these elements are compressed into a sparse matrix A by summing the elements specified by the vertex-to-element connectivity which is given as look-up table (LUT).
GPU kernel for assembling the sensitivity matrix
| 1: | % compute element index | |
| 2: | % cache connectivity | |
| 3: | ||
| 4: | % cache element matrices | |
| 5: | ||
| 6: | ||
| 7: | ||
| 8: | ||
| 9: |
| % cache forward fields |
| 10: |
| |
| 11: | | |
| 12: |
| % cache adjoint fields |
| 13: |
| |
| 14: | | |
| 15: | | |
| 16: | | % parallel reduce |
| 17: | | % store element’s sensitivity |
| 18: | | |
| 19: | | |
| 20: | ||
Fig. 3Simulation setup showing the mouse phantom, the excitation sources (cylinders) and the photon-detectors (cones). The cross-section at the height of the central optode ring shows the assumed concentration distribution which consists of two 5 mm spheres with a fluorophore concentration of 10μM.
Optical Parameters Used for the Simulations
| excitation | 0.275 | 0.036 | 8.35 ·103 | 0.2 |
| emission | 0.235 | 0.029 | 2.81 ·103 | 0.2 |
Fig. 4Reconstruction of the two fluorescent inclusions shown in Fig. 3 performed with graphics hardware acceleration. The cross-section is again taken at the height of the middle optode ring. 1% noise relative to the largest measurement datum was added for to the reconstruction.
Comparison of Single-Threaded CPU and Parallelized GPU Reconstruction Times for the Digimouse Phantom with 174080 Elements
| Task | CPU | GPU | Speed-up |
|---|---|---|---|
| Assembly of the system matrices | |||
| Compute element contributions | |||
| Mesh level 1 | n.a. | 0.04 ms | |
| Mesh level 2 | n.a. | 0.10 ms | |
| Mesh level 3 | n.a. | 0.66 ms | |
| Convert to CRS format | |||
| Mesh level 1 | n.a. | 0.72 ms | |
| Mesh level 2 | n.a. | 2.06 ms | |
| Mesh level 3 | n.a. | 13.17 ms | |
| Total | |||
| Mesh level 1 | 0.63 ms | 0.76 ms | 0.83 |
| Mesh level 2 | 6.37 ms | 2.16 ms | 2.95 |
| Mesh level 3 | 121.97 ms | 13.83 ms | 8.82 |
| Solution of the linear systems | |||
| PBCG without multigrid | |||
| Forward solution ( | 10.12 s | 736.64 ms | 13.73 |
| Adjoint solution ( | 8.84 s | 633.92 ms | 13.94 |
| PBCG with multigrid | |||
| Forward solution ( | 5.45 s | 461.60 ms | 11.81 |
| Adjoint solution ( | 6.02 s | 508.29 ms | 11.85 |
| Computation of measurements | |||
| | 496.40 ms | 6.53 ms | 76.01 |
| Assembly of the sensitivity matrix | |||
| assembleSensitivity | 16.68 s | 1.03 s | 16.23 |
| Solution of the Gauß-Newton system
| 10.87 s | 331.92 ms | 32.75 |
| Total reconstruction time | |||
| Without multigrid | 6.39 min | 27.39 s | 14.00 |
| With multigrid | 5.41 min | 24.94 s | 13.01 |
On the CPU the system matrices are assembled in one step
Estimated and True Memory Required (in Mega Bytes; MB) for Selected Data Structures of the Reconstruction Algorithm with Single-Precision Floating-Point Numbers
| Data structure | Estimate | MB |
|---|---|---|
| Local element matrices | 3 ·(4 ·4) × | 31.88 |
| Assembled matrices | depending on connectivity | 15.31 |
| Source vectors | 3.17 | |
| Detector vectors | 3.17 | |
| Forward solutions | 2 · | 6.35 |
| Adjoint solutions | 2 · | 6.35 |
| Sensitivity | ( | 382.50 |
| Total | 448.73 | |