Literature DB >> 22076279

High-performance image reconstruction in fluorescence tomography on desktop computers and graphics hardware.

Manuel Freiberger, Herbert Egger, Manfred Liebmann, Hermann Scharfetter.

Abstract

Image reconstruction in fluorescence optical tomography is a three-dimensional nonlinear ill-posed problem governed by a system of partial differential equations. In this paper we demonstrate that a combination of state of the art numerical algorithms and a careful hardware optimized implementation allows to solve this large-scale inverse problem in a few seconds on standard desktop PCs with modern graphics hardware. In particular, we present methods to solve not only the forward but also the non-linear inverse problem by massively parallel programming on graphics processors. A comparison of optimized CPU and GPU implementations shows that the reconstruction can be accelerated by factors of about 15 through the use of the graphics hardware without compromising the accuracy in the reconstructed images.

Entities: Chemical Disease Gene Species

Keywords: (100.3010) Image reconstruction techniques; (100.3190) Inverse problems; (170.6960) Tomography; (170.7050) Turbid media; (300.6280) Spectroscopy, fluorescence and luminescence

Year: 2011 PMID： 22076279 PMCID： PMC3207387 DOI： 10.1364/BOE.2.003207

Source DB: PubMed Journal: Biomed Opt Express ISSN： 2156-7085 Impact factor: 3.732

1. Introduction

In this paper, we study an indirect measurement technology, namely fluorescence optical tomography, which relies on the ability of fluorescent dyes (fluorophores) to absorb and re-emit light at different wavelengths. The aim of fluorescence tomography is to obtain the distribution of a fluorophore inside an object (an only indirectly accessible quantity) from optical measurements at the boundary (an easily obtainable information). The strength of this imaging modality is the availability of a large number of fluorophores which change their optical properties in dependence of its biological surrounding. Thus, they reflect physiologically interesting parameters like the pH value [1, 2] or tissue oxygenation [3]. Unfortunately, photons are massively scattered in biological tissues. Therefore, fluorescence tomography requires the accurate simulation of light propagation in highly scattering media and the solution of ill-posed nonlinear inverse problems, which are computationally intensive tasks. Fast and reliable numerical algorithms will crucially determine the success of this new imaging modality in biomedical sciences as well as in engineering applications. A recent trend is the application of graphics processing units (GPUs) in scientific computations. In the field of optical tomographic imaging, Zhang et al [4] have reported GPU accelerated bioluminescence reconstructions and make use the proprietary package CULA by EM Photonics for the inversion of the finite element system matrix. Prakash et al [5] also use CULA from within the Matlab-framework NIRFAST to speed up the reconstruction algorithm for diffuse optical tomography. In their work, six distinct tasks are GPU-accelerated such as the solution of a linear equation system and matrix-matrix multiplications. The previously mentioned works use the GPU as co-processor mainly, i.e. data is transferred to the graphics card where certain operations are performed in parallel. The result is copied back to the CPU then. This submission is different in the sense that we aim to perform nearly all computations on the graphics hardware without the need to transfer intermediate data between the CPU and the graphics card. Thus, the GPU is not only used for trivial multiplication tasks but also for the assembly of the forward and adjoint system matrices and the sensitivity matrix, for example. This reduces the comparatively slow data transfer between the computer and the graphics hardware to a minimum and results in much faster image reconstruction. Furthermore, all algorithms are iterative which circumvents the need for matrix inversion algorithms. Our implementation is based on the open-source library AGILE [6] which is available at http://www.imt.tugraz.at/research/agile-gpu-image-reconstruction-library/. The outline of this paper is as follows: Section 2 provides background material about fluorescence tomography and describes the iterative regularization method that is used for the stable solution of the inverse problem. Section 3 deals with the discretization of the problem by finite element methods and gives details for an efficient implementation on graphics hardware. In Section 4, we define a test problem and derive estimates for the complexity and required memory of the reconstruction algorithm. The theoretical results are confirmed by numerical tests, in which we compare the performance of our algorithms on CPU and GPU hardware. Our tests illustrate that the reconstruction times can be reduced significantly (by a factor of about 15) through the use of the graphics hardware.

2. Methods

2.1. Mathematical model

The propagation of light in highly scattering media, e.g. in biological tissue, is typically modeled by the diffusion approximation [7], which serves as a first order approximation to the radiative transport equation. Since light of two different wavelengths is employed in fluorescence tomography, the mathematical model consists of a coupled system of partial differential equations (PDE) [8], namely Here, φ, i = x, m are the photon densities for the excitation (x) and the emission (m) light in the domain Ω covered by tissue, and the source term q models a collimated light source. The parameters κ and μ are the diffusion and absorption coefficients of the tissue, and γ is a measure incorporating the fluorophore’s extinction coefficient and its quantum yield. We assume that these parameters depend in a known form on the concentration c of fluorophore in the tissue, and we write κ(c), μ(c) or γ(c) if this dependence needs to be emphasized. Homogeneous Robin boundary conditions [7] are used to model that no light enters the domain from the outside apart from the light injected through the internal source q. The simplest type of boundary measurement is the photon flux at the emission wavelength which leaves the domain which is modeled by where the function d allows to model the aperture and additional (linear) transfer characteristics of the detector. Other types of measurement functions are possible as well and have been used in e.g. [9-11]. The tomographic measurement process finally consists of illuminating the object with a sequence of light sources q, j = 1,...,ns, and taking measurements of the emission light with an array of detectors d, i = 1,...,nd at the boundary. This gives rise to a measurement matrix ℳ of dimension nd × ns, which obviously depends on the distribution c of the fluorophore, and we write ℳ(c) to express this dependence.

2.2. Image reconstruction

The aim of fluorescence tomography is to determine the inaccessible fluorophore distribution c which gave rise to the measurements ℳ—which are perturbed by measurement noise and model errors—of the emitted light. A Tikhonov-type regularization term is incorporated in the objective functional for stability reasons. Then the reconstruction problem reads To solve this problem in a fast and stable manner, we utilize the iteratively regularized Gauß-Newton method (IRGN) [12], which leads to the iterative algorithm Here, denotes the sensitivity operator (i.e. the Jacobian), which measures the effect of a change in the fluorophore distribution c onto the output ℳ(c) and is its adjoint. The regularization parameter α stabilizes the solution and is usually decreased in every iteration. For the sake of completeness, we would like to mention other possibilities for the regularizing of the Newton iteration, e.g. the truncated Newton-CG algorithm [13], the Levenberg-Marquardt method [14], or the Newton-Landweber method [15]. Several of these approaches have been demonstrated to work well for optical tomography applications [8, 16–19]. Non-quadratic penalty terms have been studied e.g. in [20, 21].

3. Implementation

In the following, we discuss the the main steps of the reconstruction algorithm in more detail and provide information for an efficient implementation. In order to quantify the expected computational workload, we also derive complexity estimates for the individual steps and provide upper bounds for the memory requirements.

3.1. Discretization

For the discretization of the governing partial differential equations, we consider a standard finite element method with piecewise linear basis functions for the photon fields. In our computations, we use an unstructured mesh with N vertices consisting of N tetrahedral elements. For computational efficiency but also simplicity, the optical parameters and the fluorophore concentration c are discretized by piecewise constant functions over the same mesh. The discretized version of the system (1)–(2) reads In this linear system, K, M and R, i ∈ {x, m}, denote the stiffness-, mass- and boundary mass-matrices at the respective wavelength. Each column of the N × ns matrix Q corresponds to a different excitation q, j = 1,...,ns, and the ns columns of the solution matrices V and V store the vector representations of the corresponding photon density fields φ and φ. The measurement data are given by ℳ(c) = D⊤ V, where each of the nd columns of the detector matrix D corresponds to one of the detectors d, i = 1,...,nd, in (4). The discrete sensitivity S = S(c) is an nd × ns × N tensor (respectively an (nd · ns) × N matrix). Its action onto a parameter perturbation h ∈ ℝ is given by where h is an arbitrary element from the N dimensional parameter space, and W, W denote the solutions of the adjoint system The fully discrete equivalent of the iteratively regularized Gauß-Newton method (6) has the form listed in Algorithm 1.

Algorithm 1

Iteratively regularized Gauß-Newton algorithm

1:	initialize
2:	c ← 0
3:	fork = 1 to maxit do
4:	[A_x, A_m, G] ← assembleSystemMatrices(c_k)
5:	V_x ← PBCG(A_x, Q)
6:	V_m ← PBCG(A_m, GV_x)
7:	M ← D^⊤V_m
8:	W_m ← PBCG(A_m, D)
9:	W_x ← PBCG(A_x, GW_m)
10:	S_k ← assembleSensitivity(c_k, V_x, V_m, W_x, W_m)
11:	dc←(Sk⊤Sk+αkI)−1Sk⊤(Mδ−M)
12:	c ← c + dc
13:	α_k₊₁ ← qα_k
14:	end for

Before going into details of the implementation and giving complexity estimates for the individual step as well as upper bounds for the memory requirements, let us make some general remarks: The presented algorithms are independent of the architecture used for the actual computations, i.e. apart from some implementation details, the same algorithms are used for CPU and GPU implementations. In our numerical tests, the high-level algorithms are always controlled by the CPU, while the actual computations (e.g. matrix-vector multiplications) are executed on the CPU or GPU, respectively, inside a few kernels. Only these few kernels have to be adapted to the specific hardware. The advantage of such a “minimally invasive” framework is that algorithms can be migrated gradually from one hardware to another while at the same time the high-level abstraction of the algorithms can be preserved. Such an approach seems very natural, and has been employed previously, e.g. for the simulation of fluid dynamics on CPU-GPU clusters [22]. All compute intensive steps of the reconstruction algorithm have a high degree of locality: A major part of the operations, e.g. the assembly of the system matrices or the sensitivity, can be performed in an element-by-element fashion. This means that similar computations can executed for individual elements independently from each other. Global operations, e.g. the solution of the linear systems, rely on sparse matrix algebra. Note that each row (or column) of the sparse matrices corresponds to a vertex of the mesh, and the non-zero entries of this row stem from degrees of freedom (vertices) belonging to the patch of elements surrounding this vertex. Therefore, also sparse-matrix operations share a high degree of locality, and consequently, the overall reconstruction algorithm is very well-suited for parallelization. Iteratively regularized Gauß-Newton algorithm

3.2. Initialization

The first task is to set up a multilevel mesh hierarchy: The vertex coordinates of the finest level are loaded, and the transfer operators between adjacent levels l and l + 1 are stored as sparse prolongation matrices . This allows to lift a function from level l to the next finer level l + 1, i.e. denotes the linear prolongation of a piecewise linear finite element function u to the finer mesh where it is represented by u+1. The mesh hierarchy is only required for the initialization step. The memory requirement for storing the prolongation matrices is O(N). The element vertex numbers are stored in an N × 4 table containing the global vertex numbers for the N individual elements. The 4-element array g then contains the numbers of the vertices spanning the tetrahedron T. During the setup step, also the vertex-to-element connectivity is required, which is simply given by . The memory requirement for these look-up tables is O(N + N). One additional look-up table is needed for the efficient assembly of the sparse system matrix. This matrix is stored in compressed row-storage (CRS) format. Its sparsity pattern can be created from the mesh hierarchy and the mapping (local index → global index → CRS data index) can be readily computed. We store the inverse of this mapping, which is a one-to-many table and requires O(N) memory. Finally, we pre-compute the mass-, boundary-mass- and stiffness-matrices, M, R and K, for unit coefficients: Each expression results in N matrices of size 4 × 4, which are copied to the GPU memory.

3.3. Sparse matrix format

The solution of linear equation systems heavily relies on fast sparse matrix-vector products. For comparison of matrix-vector kernel performance of different sparse storage layouts we refer to [23]. The results presented in this paper are based on a blocked interleaved compressed row-storage (CRS) format sketched in Fig. 1. In contrast to the conventional CRS layout, the adapted storage scheme stores the elements of the rows interleaved which makes access from different threads more efficient. As not all of the rows have the same number of non-zero entries, the shorter rows have to be padded with dummy elements. In order not to waste too much memory, this padding is done in blocks of 32 rows (a block-size of two is used in Fig. 1 to illustrate the technique). Further details about this format can be found in [24].

Fig. 1

(a) Sparse matrix which is to be stored in compressed row-storage (CRS) format. (b) The conventional memory layout on the CPU consists of the linearized data elements together with their column indices and a vector holding the offset of each row in the data vector. The number of elements in a row is given implicitly by the difference in the row offsets. (c) The adapted layout for the GPU stores the rows in an interleaved manner which requires memory padding marked with × and an additional vector holding the number of non-zero entries per row.

3.4. Assembly of the system matrices

When assembling of the system matrix, the individual element contributions have to be summed. In a straightforward implementation this would result in non-local random write-access on the output matrix which is costly on graphics hardware. However, due to the piecewise constant discretization of the coefficients κ, μ, ρ and c, the contribution of an element T to the excitation and emission system is given as Therefore, we split the assembly into two steps: First, the pre-computed element matrices K, M and R are scaled with the element’s optical parameters and the output is stored in a vector H with length (4 · 4 · N). In the second step, we sum for every position in the CRS matrix the corresponding elements from H. In principle, this summation can be done by multiplying a suitable sparse matrix with H. However, as all non-zero elements in this matrix are 1, we implemented a specialized GPU kernel for this task, which simply sums the elements whose indices are stored in a look-up table. By using such a two-step scheme, we avoid non-local write-access and need non-local read-access only for the second summation step, which is implemented using the GPU’s texture cache. Figure 2 illustrates this procedure. As the construction of the local element matrices does only require local memory access, it can be done fully in parallel which provides almost optimal performance on both, CPU and GPU processors. The total complexity of this step is O(N).

Fig. 2

Assembly of the system matrix: The 4 × 4 stiffness-, mass- and boundary mass-matrices K, M, and R of every element are scaled with the optical parameters κ, μ and ρ, summed and stored temporarily into a vector T. In a second step, these elements are compressed into a sparse matrix A by summing the elements specified by the vertex-to-element connectivity which is given as look-up table (LUT).

3.5. Solution of the linear systems

For the solution of the linear systems A = Q and A = G(c)V for the forward problem, respectively, A = D and A = G(c)W for the adjoint problem, we employ a preconditioned blocked conjugate gradient method (PBCG): instead of solving for one right hand side (i.e. a column in Q) after another, we solve simultaneously for all right hand sides (blocked). The algorithm uses the PCG framework provided by the AGILE library. Thus, only a new kernel had to be coded for the blocked matrix-vector-product. To accelerate the convergence, a single V -cycle of a geometric multigrid preconditioner [25] is used. On the coarsest mesh level, an exact solve is performed by a blocked conjugate gradient method (BCG), which is similar to PBCG but without further preconditioning. A single BCG iteration is also used in the smoothing step. The application of one multigride V -cycle has linear complexity. If the right-hand side has ns columns, one application of the multigrid preconditioner requires O(ns ·N) algebraic operations. Note that all basic operations in the conjugate gradient method and the multigrid preconditioner (e.g. the matrix-vector products or simpler vector operations) can easily be executed on CPUs or GPUs, respectively. Due to the preconditioning, the typical iteration numbers required for convergence of the PBCG algorithm are 10–15. Therefore, the total complexity of the solution of the forward and adjoint problems is O(ns ·N) and O(nd ·N), respectively.

3.6. Assembly of the sensitivity

For the assembly of the sensitivity matrix, we use the representation (9). As h is a piecewise constant function, the mass-M(h) and stiffness-matrices K(h) are just the element matrices M and K for the T -th finite element scaled by the derivatives of the optical parameters. Since the computations for the individual elements are independent from each other, this loop can be executed in parallel. The pseudo-code for the GPU kernel used to assemble the sensitivity matrix is listed in Algorithm 2. The threads are started in blocks of dimension (x, y, z) = (4 × 4 × 32). The size of (4 × 4) is due to the size of the element-wise mass- and stiffness-matrices on a tetrahedral mesh. The last dimension is the number of elements which are processed in parallel and has been set to 32 to maximize the number of threads per block.

Algorithm 2

GPU kernel for assembling the sensitivity matrix

1:	n^z ← 32 · blockIdx + z	% compute element index
2:	d^z[x] ← g^{n^z} (x)	% cache connectivity
3:	synchronize
4:	k^z(x,y) ← K^{n^z} (x,y)	% cache element matrices
5:	m^z(x,y) ← M^{n^z} (x,y)
6:	r^z(x,y) ← R^{n^z} (x,y)
7:	p ← 0
8:	fork = 0 to ns – 1 do
9:	vxz[x]←Vx(dz[x],k)	% cache forward fields
10:	vmz[x]←Vm(dz[x],k)
11:	forl = 0 to nd – 1 do
12:	wxz[y]←Wx(dz[y],l)	% cache adjoint fields
13:	wmz[y]←Wx(dz[y],l)
14:	synchronize
15:	H^z(x,y) = f(k^z(x,y), m^z(x,y), r^z(x,y), vxz[x], vmz[x], wxz[y], wmz[y])
16:	t^z ← Σ_xyH^z(x, y)	% parallel reduce
17:	S(p,n^z) ← t^z	% store element’s sensitivity
18:	p ← p + 1
19:	end for
20:	end for

For readability, all data structures which exist once for every element being processed carry a superscript z instead of writing them as an array. Thus, n is actually an array with 32 entries, d[x] is a 2D array with size (32 × 4) and k(x,y) is a 3D array with size (32 × 4 × 4), for example. The number of the element which is processed by a thread can be inferred from the index of the thread block as shown in line 1. In line 2, g denotes the element-to-vertex connectivity of the T -th finite element, which is cached into the vector d. The algorithm caches the element’s mass- and stiffness-matrices (lines 4–6) and the forward and ajoint solutions (lines 9–10 and 12–13). Line 15 is an abbreviation for the computation of the sensitivity contribution according to Eq. (9). These contributions are summed in parallel (line 16) and the result is stored in S. Assembling the sensitivity matrix requires O(ns ·nd ·N) operations and the same amount of memory. Hence, this is the most time consuming and memory intensive step in the current algorithm. We would also like to note that the assembly of the sensitivity matrices S can be avoided completely if it is applied on-the-fly when computing the matrix-vector products; cf [19] and the remarks in section 4.6. GPU kernel for assembling the sensitivity matrix

3.7. Solution of the Gauß-Newton system

Formally, the Gauß-Newton matrix is a dense N × N matrix, which would be expensive to compute and almost impossible to store. Since by construction is symmetric and positive definite, the Gauß-Newton system can be solved efficiently by the method of conjugate gradients, which requires only the multiplication with S and . In this manner, the complexity of multiplication with the Gauß-Newton matrix is given by O(ns ·nd ·N) algebraic operations, whereas a multiplication with the pre-multiplied matrix would be O(N). For the iterative solution of the Gauß-Newton systems we utilize hardware optimized dense matrix-vector kernels, i.e. LAPACK-BLAS [26] for the CPU and a custom implementation based on AGILE for the GPU version.

4. Results

4.1. Model problem

For our numerical tests, we considered a mouse phantom based on the Digimouse model [27]. The measurement setup consisted of ns = 24 sources and nd = 24 detectors, which were arranged on three rings around the homogeneous torso. Two spherical fluorophore inclusions with a diameter of 5 mm were placed at the height of the middle optode ring inside the body, and the aim of our test was to reconstruct these inclusions. The setup is depicted in Fig. 3.

Fig. 3

Simulation setup showing the mouse phantom, the excitation sources (cylinders) and the photon-detectors (cones). The cross-section at the height of the central optode ring shows the assumed concentration distribution which consists of two 5 mm spheres with a fluorophore concentration of 10μM. For our simulations we used realistic optical tissue properties which were gathered from literature [28-31], cf Table 1. These parameters were kept constant during simulation and reconstruction. In practice, the estimation of those background parameters would require additional a-priori knowledge or measurements.

Table 1

Optical Parameters Used for the Simulations

	μ′_s mm⁻¹	μ_a,i mm⁻¹	ɛ mm⁻¹ M⁻¹	ρ
excitation	0.275	0.036	8.35 ·10³	0.2
emission	0.235	0.029	2.81 ·10³	0.2

Optical Parameters Used for the Simulations The data used in our numerical tests were generated on a finer mesh with 200835 elements and additionally perturbed by 1% Gaussian noise relative to the largest measurement datum. For the finite element discretization, we used a hierarchy of three nested tetrahedral meshes, which were generated with the open source mesh generator NETGEN [32]. The three meshes consisted of 2720, 21 760, and 174080 elements and 853, 5103, and 34 677 vertices, respectively.

4.2. Choice of parameters

The regularization parameters α and the stopping index maxit in the Gauß-Newton algorithm were chosen as follows: We conducted a series of numerical tests for ten similar phantoms, and determined a set of parameters α that worked reasonably well for all samples. These parameters were then used for the actual reconstruction. For the results presented here, a geometric sequence α = qα0 with α0 = 1 and q = 0.2 was used. The Gauß-Newton iteration was stopped when the regularization parameter reached the pre-defined threshold α = 1 × 10−5, which corresponds to maxit=8 Newton iterations.

4.3. Numerical results

All computations were executed on an AMD Phenom 9950 processor. Both, the CPU code and the GPU host code (i.e. not the code running on the graphics hardware ifself) are single-threaded. The compute intensive operations (e.g. the matrix-vector multiplications and the assembly of the sensitivity matrix) were realized as compute kernels, which were implemented for both, CPU and GPU hardware. For our GPU tests, we used an NVIDIA GTX 480 graphics card and NVIDIA’s CUDA toolkit [33] for programming. A typical result of the image reconstruction is shown in Fig. 4.

Fig. 4

Reconstruction of the two fluorescent inclusions shown in Fig. 3 performed with graphics hardware acceleration. The cross-section is again taken at the height of the middle optode ring. 1% noise relative to the largest measurement datum was added for to the reconstruction.

4.4. Performance analysis

In Table 2, we provide a detailed performance analysis of the reconstruction algorithm and compare the computation times of the CPU and GPU implementations. For both architectures we also compared the reconstruction with and without multigrid preconditioning. For all relevant tasks, the execution on the GPU results in a significant speed-up. An exception is the assembly of the system matrices on the coarsest level, where the number of elements is so small that the acceleration achieved on the graphics card is spoiled by the latency for starting the kernels on the GPU. However, the time spent for this part of the computation is negligible. The observed speed-up by factors of about 10 is in good agreement to the accelerations of sparse matrix-vector products [34]. (Note that the factor 10 is mainly due to the approximately 10 times larger memory bandwidth provided by GPUs).

Table 2

Comparison of Single-Threaded CPU and Parallelized GPU Reconstruction Times for the Digimouse Phantom with 174080 Elements

Task	CPU	GPU	Speed-up

Assembly of the system matrices
Compute element contributions
Mesh level 1	n.a.¹	0.04 ms
Mesh level 2	n.a.¹	0.10 ms
Mesh level 3	n.a.¹	0.66 ms
Convert to CRS format
Mesh level 1	n.a.¹	0.72 ms
Mesh level 2	n.a.¹	2.06 ms
Mesh level 3	n.a.¹	13.17 ms
Total
Mesh level 1	0.63 ms	0.76 ms	0.83
Mesh level 2	6.37 ms	2.16 ms	2.95
Mesh level 3	121.97 ms	13.83 ms	8.82

Solution of the linear systems
PBCG without multigrid
Forward solution (V_x, V_m)	10.12 s	736.64 ms	13.73
Adjoint solution (W_x, W_m)	8.84 s	633.92 ms	13.94
PBCG with multigrid
Forward solution (V_x, V_m)	5.45 s	461.60 ms	11.81
Adjoint solution (W_x, W_m)	6.02 s	508.29 ms	11.85

Computation of measurements
D^⊤V_m	496.40 ms	6.53 ms	76.01

Assembly of the sensitivity matrix
assembleSensitivity	16.68 s	1.03 s	16.23

Solution of the Gauß-Newton system (Sk*Sk+αkI)−1	10.87 s	331.92 ms	32.75

Total reconstruction time
Without multigrid	6.39 min	27.39 s	14.00
With multigrid	5.41 min	24.94 s	13.01

On the CPU the system matrices are assembled in one step

Comparison of Single-Threaded CPU and Parallelized GPU Reconstruction Times for the Digimouse Phantom with 174080 Elements On the CPU the system matrices are assembled in one step

4.5. Accuracy

Only single-precision arithmetic executes very fast on current commodity graphics hardware, i.e. use of double-precision operations results in increased memory requirements and a tremendous performance drop. In order to obtain insight into the errors introduced by using single-precision operations, we conducted the tests in single-precision on CPU and GPU hardware, and also in double-precision on the CPU. First, the outcome of the forward operator (i.e. the simulated measurement data) is compared for the true fluorescence distribution on the mesh with with 200835 elements. The errors are related to the largest measurement datum such that the numerical error can be compared to the data noise easily. Given the single- and double-precision CPU and GPU measurement data ℳ, ℳ and ℳ the discrepancies are and As can be seen, the numerical error is far beyond the measurement noise of 1% of the largest measurement datum. For practical applications, the “noise” will be determined mainly by the measurement setup (e.g. inaccuracies in the optode locations), model errors and the detector noise, whereas the additional numerical noise is negligible. As a consequence, the same regularization parameter (which has to be chosen depending on the noise level) can be used for both the CPU and GPU implementation. To give evidence for the similarity of the resultant images, the relative errors based on the L2 norm of the reconstructed concentrations are compared. Denoting the reconstructed fluorophore concentration from the single- and double-precision CPU and GPU implementation with c, c and c, these relative errors are The resultant images differ by less than 1% which we consider to be sufficient for practical applications.

4.6. Memory requirements

In Table 3, we list the memory requirements for the essential data structures on the finest finite element mesh. The additional memory required for the coarse level matrices required for the multigrid preconditioner is negligible.

Table 3

Estimated and True Memory Required (in Mega Bytes; MB) for Selected Data Structures of the Reconstruction Algorithm with Single-Precision Floating-Point Numbers

Data structure	Estimate	MB
Local element matrices K, M, R	3 ·(4 ·4) × N_T	31.88
Assembled matrices A_x, A_m, G	depending on connectivity	15.31
Source vectors Q	ns × N_V	3.17
Detector vectors D	nd × N_V	3.17
Forward solutions V_x, V_m	2 ·ns × N_V	6.35
Adjoint solutions W_x, W_m	2 ·nd × N_V	6.35
Sensitivity S	(nd ·ns) × N_T	382.50

Total		448.73

Estimated and True Memory Required (in Mega Bytes; MB) for Selected Data Structures of the Reconstruction Algorithm with Single-Precision Floating-Point Numbers In our implementation, the sensitivity matrix requires most of the GPU memory. To avoid excessive memory consumption, the product of the sensitivity matrix with a vector can also be computed on-the-fly, without the need to store the whole sensitivity. Also the complexity of the multiplication with the sensitivity (now O(ns ·nd ·N)) can be reduced, i.e. it has been demonstrated in [19] that the application of the sensitivity can be realized in O(ns · N) operations. Such an approach, however, requires the solution of the adjoint problems for every multiplication with the sensitivity matrix. Comparing with the computation times in Table 2, this would drastically slow down the computations in practice, at least for the values of ns, nd, and N, N used in our test study. If the number of sources or detectors becomes very large (e.g. if a camera is used as detector), such an alternative might become favorable.

5. Discussion

In this article, we considered the image reconstruction in fluorescence tomography by iterative regularization methods of Newton-type and a discretization by finite element methods. We demonstrated that a combination of state of the art numerical methods, the utilization of modern hardware, and a hardware optimized implementation allows to solve this large-scale nonlinear inverse problem within a few seconds on standard desktop PCs, without compromising accuracy. In particular, the extensive use of the GPU’s computing power resulted in a significant speed-up of the reconstruction process by factors of approximately 10–15. Such a degree of acceleration is in good agreement to other work [34, 35]. The use of unstructured meshes for our finite element simulations provides high flexibility with respect to geometry and facilitates the mesh generation process. A further acceleration of the computations could be achieved, however, by utilizing structured grids, which lead to structured sparsity patterns of the system matrices and allow faster access of global memory, cf [36]. In contrast to previous work [4, 5], we have presented algorithms which are fully iterative in nature and do not require matrix inversion libraries such as CULA, for example. Additionally, we ported the whole reconstruction algorithm to the graphics card and use the CPU only as controller which avoids expensive memory transfer between the host and the GPU. The pre-computation of the mass- and stiffness-matrices of the individual elements and the creation of prolongation and restriction matrices for the mesh hierarchy are the only tasks which are still executed on the CPU. However, these setup steps have to be computed only once and could in principle also be stored along with the mesh data. To achieve efficient code on graphics hardware, care has to be taken when accessing global memory. Memory access is most efficient, if a sequence of threads reads or writes to a sequential memory area (coalesced memory access) [33]. Therefore, the photon field matrices V, V and W, W, but also the associated source and detector matrices Q and D are stored in row-major order. As one column of these matrices belongs to one photon field, this means that the elements of a given photon field are interleaved. In combination with the blocked conjugate gradient method, internal performance measurements have shown that this memory layout provides speed-ups by a factor of roughly 2–3 on CPUs and up to 10 on GPUs. The lack of fast double-precision floating point operations on current graphics hardware limits the accuracy of the reconstructions to some extent. In our numerical examples, the relative errors in the forward simulation due to the single-precision arithmetic was less then 1% for both the CPU and GPU reconstructions. Thus, the numerical errors can be expected to be dominated by measurement noise and model errors and single-precision arithmetic may well suffice for this kind of (severely ill-posed) inverse problems. The choice of an appropriate sequence of regularization parameter α crucially determines the performance of the iterative reconstruction algorithm. Following the analysis of the iteratively regularized Gauß-Newton method [12, 15], α should decay exponentially, e.g. α = α0q for some 0 < q < 1. The choice of α0 and q, however, depends on the problem. While in principle, appropriate values can be found by a careful analysis of the problem under investigation, a more practical way to tune these parameters is to perform a sequence of reconstructions for known solutions and select regularization parameter that work well for all cases in this training set. The same set of parameters can then be used for similar problems. The choice of the stopping index maxit is another crucial ingredient for the numerical algorithm. If α0 and q have been fixed, and α = α0q, then this amounts to choosing a minimal regularization parameter α := αmaxit. A rigorous way to choose α (respectively maxit) is provided by Morozov’s discrepancy principle [37, 38], i.e. the reconstruction algorithm is stopped, as soon as ||ℳ(c) – ℳ|| ∼ δ, where δ is a measure for the measurement and model errors. Another rigorous way would be to choose α ∼ δ. Both choices however require the knowledge of the level δ of the perturbations, which is usually not known in practice. A heuristic way to choose α without knowledge of δ is provided by the L-curve method [38,39] or generalized cross-validation [40]. If a series of similar problems is considered, then maxit (or α) can again be obtained by “training” the algorithm on a small data set with known solutions. This strategy is probably the most practical one and also the method we utilize in our experiments. In our opinion, the fast and reliable image reconstruction is a core requirement for the acceptance of new imaging modalities like fluorescence tomography in pre-clinical and also engineering applications. GPU-accelerated algorithms can result in reconstruction times in the order of a few seconds, which allows the inspection of the results on-the-fly and, thus, certainly facilitates the adoption of this technique in a non-research environment.

16 in total

1. Optical diffusion in layered media.

Authors: M Keijzer; W M Star; P R Storchi
Journal: Appl Opt Date: 1988-05-01 Impact factor: 1.980

2. Nonlinear inversion schemes for fluorescence optical tomography.

Authors: Manuel Freiberger; Herbert Egger; Hermann Scharfetter
Journal: IEEE Trans Biomed Eng Date: 2010-06-17 Impact factor: 4.538

3. Functional optical detection based on pH dependent fluorescence lifetime.

Authors: Israel Gannot; Izhar Ron; Farid Hekmat; Victor Chernomordik; Amir Gandjbakhche
Journal: Lasers Surg Med Date: 2004 Impact factor: 4.025

4. Gauss-Newton method for image reconstruction in diffuse optical tomography.

Authors: Martin Schweiger; Simon R Arridge; Ilkka Nissilä
Journal: Phys Med Biol Date: 2005-05-05 Impact factor: 3.609

5. Three-dimensional fluorescence lifetime tomography.

Authors: Anuradha Godavarty; Eva M Sevick-Muraca; Margaret J Eppstein
Journal: Med Phys Date: 2005-04 Impact factor: 4.071

6. Digimouse: a 3D whole body mouse atlas from CT and cryosection data.

Authors: Belma Dogdas; David Stout; Arion F Chatziioannou; Richard M Leahy
Journal: Phys Med Biol Date: 2007-01-10 Impact factor: 3.609

7. Frequency-domain fluorescent diffusion tomography: a finite-element-based algorithm and simulations.

Authors: H Jiang
Journal: Appl Opt Date: 1998-08-01 Impact factor: 1.980

8. Adaptive finite element based tomography for fluorescence optical imaging in tissue.

Authors: Amit Joshi; Wolfgang Bangerth; Eva Sevick-Muraca
Journal: Opt Express Date: 2004-11-01 Impact factor: 3.894

9. Light-absorbing properties, stability, and spectral stabilization of indocyanine green.

Authors: M L Landsman; G Kwant; G A Mook; W G Zijlstra
Journal: J Appl Physiol Date: 1976-04 Impact factor: 3.531

10. Characterization of tumorous and normal tissue using a pH-sensitive fluorescence indicator (5,6-carboxyfluorescein) in vivo.

Authors: S Mordon; V Maunoury; J M Devoisselle; Y Abbas; D Coustaud
Journal: J Photochem Photobiol B Date: 1992-05-15 Impact factor: 6.252

3 in total

1. Towards real-time detection of seizures in awake rats with GPU-accelerated diffuse optical tomography.

Authors: Tao Zhang; Junli Zhou; Paul R Carney; Huabei Jiang
Journal: J Neurosci Methods Date: 2014-11-01 Impact factor: 2.390

2. High-Speed GPU-Based Fully Three-Dimensional Diffuse Optical Tomographic System.

Authors: Manob Jyoti Saikia; Rajan Kanhirodan; Ram Mohan Vasu
Journal: Int J Biomed Imaging Date: 2014-04-10

3. X-ray nanoprobes and diffraction-limited storage rings: opportunities and challenges of fluorescence tomography of biological specimens.

Authors: Martin D de Jonge; Christopher G Ryan; Chris J Jacobsen
Journal: J Synchrotron Radiat Date: 2014-08-27 Impact factor: 2.616

3 in total