Literature DB >> 34149144

Variational regularisation for inverse problems with imperfect forward operators and general noise models.

Leon Bungert^1,2, Martin Burger¹, Yury Korolev^3,4, Carola-Bibiane Schönlieb³.

Abstract

We study variational regularisation methods for inverse problems with imperfect forward operators whose errors can be modelled by order intervals in a partial order of a Banach lattice. We carry out analysis with respect to existence and convex duality for general data fidelity terms and regularisation functionals. Both for a priori and a posteriori parameter choice rules, we obtain convergence rates of the regularised solutions in terms of Bregman distances. Our results apply to fidelity terms such as Wasserstein distances, φ-divergences, norms, as well as sums and infimal convolutions of those.

Entities: Chemical Disease Gene

Keywords: Banach lattices; Bregman distances; Kullback–Leibler divergence; Wasserstein distances; discrepancy principle; f-divergences; imperfect forward models

Year: 2020 PMID： 34149144 PMCID： PMC8208616 DOI： 10.1088/1361-6420/abc531

Source DB: PubMed Journal: Inverse Probl ISSN： 0266-5611 Impact factor: 2.407

Introduction

We consider linear inverse problemswhere is a linear bounded operator (referred to as the forward operator or the forward model) acting between two Banach spaces and . The exact measurement is typically not available and only a noisy version of it f is known along with an estimate of the noise level δ. Since the inversion of (1.1) is often unstable with respect to noise and hence ill-posed, it requires regularisation. Variational regularisation replaces solving (1.1) by the following optimisation problemwhere is a so-called data fidelity function that models statistical properties of the noise in f and is a regularisation functional that stabilises the inversion. The regularisation parameter α > 0 balances the influence of the data fidelity and the regularisation. The amount of noise δ in the measurement f is assumed to be such that The fidelity function often depends only on the difference of the arguments, i.e. for some function h. The most common example is . There are, however, cases when the fidelity function depends on its arguments in a more complicated manner; an example is the Kullback–Leibler divergence that is used to model Poisson noise [1], where (see also the review paper [2]). Problems with general fidelity functions were analysed in [3, 4]. To guarantee convergence of the minimisers of (1.2) to a solution of (1.1) as the noise level δ decreases, the regularisation parameter α needs to be chosen as a function of the measurement noise α = α(δ) (a priori parameter choices) or of the measurement itself and of measurement noise α = α(f, δ) (a posteriori parameter choices). For a priori parameter choice rules, convergence rates for solutions of (1.2) in different scenarios have been obtained, e.g., in [5-9]. A classical a posteriori parameter choice rule is the so-called discrepancy principle, originally introduced in [10] and later studied in, e.g., [11-13]. Roughly speaking, it consists in choosing α = α(f, δ) such that the following equation is satisfiedwhere u is the solution of (1.2) corresponding to the regularisation parameter α. In many applications, not only the measurement f is noisy, but also the forward operator A that generated the data is not precisely known. Errors in the operator may come from the uncertainty in some model-related parameters such as the point-spread function of a microscope, simplified model geometry and/or discretisation. A classical approach to modelling errors in the forward operator assumes an error estimate in the operator norm, i.e.where is a linear bounded operator that we have numerical access to and h ⩾ 0 describes the approximation error (e.g., [14-17]). To guarantee convergence in this setting, the parameter α needs to be chosen as the function of δ and h (a priori choice rules) or of δ, h, f and A (a posteriori choice rules). Generalisations of the discrepancy principle to this setting are available [18-20], but they usually rely on a triangle inequality that needs to satisfy. An alternative approach to modelling operator errors using order intervals in Banach lattices was proposed in [21-23]. It assumes that the spaces and have a lattice structure [24] and, instead of (1.4), lower and upper bounds for the operator are availablewhere the inequalities are understood in the sense of a partial order for linear operators, i.e.The inequalities in (1.6) are understood in an abstract sense of a Banach lattice; which for L spaces means inequality almost everywhere. In order for the partial order bounds (1.5) to be well-defined, we assume that is a regular operator [24], i.e. that it can be written as a difference of two positive operators, A = A1 − A2, where for any u ⩾ 0 it holds that A1,2u ⩾ 0. Some examples of regular operators will be given later. The approach (1.6) to describing errors in the forward operator was studied in the context of the residual method in the case when the data fidelity is a characteristic function of a norm ballIn this case, one solves the following problemwhere and are pointwise (a.e.) lower and upper bounds for the exact data in (1.1) such that and is the constant one-function. For comparison, with the data term (1.7) and without an operator error, (1.2) translates intowhere the constraint is equivalent to ‖Au − f‖∞ ⩽ δ. (In [25], a connection is made between the lower and upper bounds fl, fu and confidence intervals.) One can show that the partial order based condition (1.5) implies the norm based condition (1.4). Indeed, given Al, Au as in (1.5), one definesIt can be readily verified that the so defined A satisfies (1.4). The opposite implication is, in general, wrong. Hence, if an estimate (1.5) is available, it allows one to describe the operator error more precisely and one may expect better reconstructions. Indeed, it was found in [23] that solving (1.2) with and α chosen according to a generalised discrepancy principle [18] based on (1.4) produces overregularised solutions compared to (1.8), i.e. the generalised discrepancy principle tends to overestimate the regularisation parameter. One of the reasons for this is the use of the triangle inequality to account for (1.4), which makes the estimates not sharp, in general. The motivation for this paper is two-fold. First, we want to extend the approach (1.5) and (1.8) to a broader class of fidelity terms than the characteristic function of a ball and more general data spaces than L∞. We also aim at a unified analysis of problems with fidelities that do not satisfy a triangle-type inequality, which is interesting in its own right. Our proofs mostly rely on convex analysis and duality. Setup. We consider the inverse problem (1.1), where and are duals of Banach lattices and , respectively. We assume that the partial order on is induced by the partial order in as follows: (cf lemma A.4 in the appendix). Furthermore, we assume that (1.1) possesses a non-negative -minimising solution , i.e. We propose the following extension of (1.2) to the case when the forward operator is known only up to the order interval given in (1.5)where and (as a function of its first argument) are assumed proper, convex and weakly-* lower semicontinuous (cf assumption 1). Main contribution. In this work we study convergence of solutions of (1.11) to a -minimising solution of (1.1) as the noise in data and operators decreases, and obtain convergence rates in one-sided Bregman distances with respect to . We also give conditions when (1.11) admits strong duality, in which case the convergence rates translate to symmetric Bregman distances. Furthermore, we analyse an a posteriori parameter choice rule based on a discrepancy principle for (1.11). Our results apply inter alia to general φ-divergences, as for instance the Kullback–Leibler divergence, and coercive fidelities such as powers of norms or Wasserstein distances from optimal transport. In addition, we also obtain rates for sums and infimal convolutions of different fidelities, as used for instance in mixed-noise removal. Even for exact operators, our analysis goes beyond the state of the art in problems with fidelity terms that lack a triangle-type inequality. Structure of the paper. In section 2 we study existence of solutions of the problem (1.11) and its dual and establish sufficient conditions for strong duality. In section 3 we derive convergence rates for a priori parameter choice rules. In section 4 we formulate a discrepancy principle for the problem (1.11) and also obtain convergence rates. For readers’ convenience, we present some background material on Banach lattices in the appendix.

Examples of regular operators

Below, we give some examples of regular operators and discuss how lower and upper bounds in the sense of (1.5)–(1.6) can be obtained. If is an abstract maximum space (a generalisation of L∞) or if is an abstract Lebesgue space (a generalisation of L1) then all linear bounded operators are regular, i.e. they can be written as a difference of two positive operators. More details can be found in the appendix. (Integral operators—perturbations of the kernel).Let A: L(Ω) → L(Ω) ( bounded, p, q ⩾ 1) be an integral operator with a (p, q)-bounded kernel k [26],The operator A can be written aswhere k+ and k− are the positive and the negative parts of k (in the a.e. sense in Ω × Ω). Clearly, A± are positive and A is regular. Suppose that the kernel is corrupted by an unknown (p, q)-bounded perturbation such that we only know pointwise lower and upper bounds for k,Then lower and upper operators in the sense of (1.5) are given byIt should be noted that the bounds (1.13) are of a deterministic nature. They could arise, for example, if the kernel depends on additional parameters θ ∈ Θ, i.e. k(x, ξ) = k(x, ξ). If reconstructing the unknown parameter θ is not of independent interest, the dependence on it can be eliminated by definingprovided the suprema and infima are finite for a.e. x, ξ and kl,u are (p, q)-bounded. (Integral operators—discretisation). Let the operator A be as defined in example 1.2 on an interval and consider its approximation by Riemann sums. In particular, let and denote the lower and upper Riemann sums in (1.12) obtained using an n-point discretisation. Then these sums define lower and upper operators in the sense of (1.5),As we refine the discretisation (i.e. n → ∞), these bounds converge pointwise to Au(x). (Integration with respect to a vector-valued measure). Example 1.2 can be generalised as follows. Let be a vector-valued Radon measure [27], where Ω is a compact metric space and Y is a Banach lattice with the Radon–Nikodým property. Define partial order on as followsLet be defined as followsSince Y is a lattice, it is clear that A is regular. Lower and upper bounds in the sense of (1.14) define lower and upper operators Al,u in the sense of (1.5). (1D source identification). We consider the operator , A: u ↦ φ, where φ solvesHere is a continuous function which meets a ⩾ a0 > 0 on [0, 1] and is a Radon measure with integrable antiderivative . Integrating the equation yieldsClearly, A ⩽ 0 and hence regular. Hence, if are continuous functions such that on [0, 1] and on [0, 1], we can define operatorswhich meet Alu ⩽ Au ⩽ Auu for u ⩾ 0 (and hence U ⩾ 0). If , then Al,u converge to A in the operator norm. If one defines the operator A on L1((0, 1)) instead of , the antiderivative U is continuous and one can approximate the integrals in Al and Au with lower and upper Riemann sums, respectively. This gives rise to operators and such that . If then additionally n → ∞, the operators converge to A. Note that a similar approach can be used for estimating the diffusivity a for a given source term. In this case, however, the forward operator A becomes non-linear. This would require an extension of our theory. (Conditional expectations). Let Ω be a separable metric space and (Ω, Σ, μ) be a probability space. Let B ⊂ Σ be a sub-σ-algebra of Σ and let be its minimal generator (which exists, since Ω is separable). The conditional expectation operator is defined as followsunder the convention 0/0 = 0. Clearly, A ⩾ 0 and hence regular. If we allow μ to be a finite signed measure, then we can generalise the definition as followswhere is the total variation of μ. Clearly,and A± ⩾ 0, hence A is regular. In contrast to example 1.4, partial order bounds on μ in the sense of (1.14) do not translate into lower and upper bounds (1.6) for A since A is not an integral operator (in particular, it is not linear in μ).

Primal and dual problems

In this section we establish existence of solutions to (1.11) using the direct method, where standard assumptions on the forward operators, the regularisation, and fidelity function will guarantee coercivity and lower semicontinuity. Subsequently, we derive the dual maximisation problem and prove existence and strong duality under the additional assumption that the data space is an abstract maximum space.

Existence of a primal solution

We make the following standard assumptions on the regularisation functional , the fidelity function , and the operators Al,u. The regularisation functional is Proper, convex and weakly-* lower semicontinuous; Its non-empty sublevel sets are weakly-* sequentially compact. The fidelity function is Proper, convex in its first argument and weakly-* lower semicontinuous jointly in both arguments; if and only if v = f. The operators are weak-* to weak-* continuous. A sufficient condition for assumption 2 to hold is given in lemma A.5 in the appendix. Suppose that assumptions 1 and 2 hold true. Then (1.11) has a solution. Consider a minimising sequence (u, v). Due to assumption 1 there exists a convergent subsequence u (that we don’t relabel) such thatThen assumption 2 yieldsFrom (1.11) we get that for all khenceandsince weakly-* convergent sequences are bounded. Since is a dual of a separable Banach space , by the sequential Banach–Alaoglu theorem the sequence v contains a weakly-* convergent subsequence v (that we do not relabel) such thatSince both Al,uu and v converge weakly-* and order intervals in are weakly-* closed due to lemma A.4, we obtain thatHence (u∞, v∞) is feasible for (1.11). Furthermore, since and are weakly-* lower semicontinuous, we get thatTherefore, (u∞, v∞) is a solution of (1.11). □

Dual problem

To simplify our notation, we introduce an operator and an operator With this notation we can rewrite (1.11) as follows The (Lagrangian) dual problem of (2.3) is given by The Lagrangian of (2.3) is given bywhere , μ ⩾ 0. Minimising the Lagrangian in u and v, we obtain Taking a supremum over μ ⩾ 0 gives (2.4). □ It is well known (e.g., [28]) thatwhich is referred to as weak duality. If the fidelity function depends only on the difference of its arguments, i.e. , thenand problem (2.4) becomesIf , we have and hence we obtain the standard form (e.g., [29])

Existence of a dual solution and strong duality

The goal of this section is to study the relationship between the primal problem (2.3) and its dual (2.4), establishing strong duality and existence of a dual solution, and obtaining complementarity conditions for Lagrange multipliers associated with constraints in (2.3). We will need the following result from [28, theorem 2.165]. ([28]). Consider the following optimisation problemand its dualwhere X and Y are Banach spaces, L: X → Y is a linear bounded operator, L* its adjoint, K ⊂ Y a closed convex set, and a proper convex lower semicontinuous function with convex conjugate . The characteristic function of K is denoted by χ(⋅) and its convex conjugate (i.e. the support function of K) by . Suppose that the following regularity condition is satisfiedThen there is no duality gap between problems () and (). If the optimal value of () is finite, then the dual problem () has at least one solution . The regularity condition (2.6) is due to Robinson [30] and plays an important role in the stability of optimisation problems under perturbations of the feasible set [28]. To ensure that (2.6) is satisfied in the primal problem (2.3), we will need to assume that the positive cone in has a non-empty interior. This naturally leads to the concept of abstract maximum spaces [24] which are a generalisation of L∞(Ω). A Banach lattice with norm ‖⋅‖ is called an AM-space (abstract maximum space) ifAn element which meetsis called unit of . Here x ∨ y and denote the usual supremum and absolute value of elements in a Banach lattice (cf appendix). Let be an AM-space with unit and suppose that there exist and such thatwhere ɛ > 0 is a constant. Then Robinson’s condition (2.6) is satisfied in the primal problem. In the notation of theorem 2.4, we have , , L ≔ (B, −E) and (where denotes the negative cone in ). Take an arbitrary with ‖y‖ ⩽ ɛ. Without loss of generality we can choose the norm on to be ‖y‖ = max(‖y1‖, ‖y2‖). Hence, the definition of the unit impliesTo show Robinson regularity, we need to write y asfor some , and , z1,2 ⩽ 0. Writing this in terms of Al and Au, we getTake u = u0 and v = v0. Thenand we can take z1,2 as above to represent y as in (2.7). Hence, the Robinson condition (2.6) is satisfied. □ Since the optimal value of the primal problem (2.3) is finite, using theorem 2.4 we conclude that there exists a solution μ of the dual problem (2.4) and there is no duality gap, i.e.where (u, v) is a primal optimal solution. Moreover, from [28, theorem 3.6] we conclude that μ is a Lagrange multiplier for the constraint Bu ⩽ Ev in (2.3) and the following complementarity condition holds Let μ be an optimal solution of (2.4) and (u, v) be an optimal solution of (2.3). Then under the assumptions of theorem 2.6 we have the following relations Using the Fenchel–Young inequality, strong duality (2.8) and the feasibility of (u, v), we obtain Hence, equality holds everywhere and we get that and . □

Convergence analysis

Having investigated well-posedness of the primal and dual problems, we can now prove convergence rates of solutions as the noise in the data and the operator tends to zero. To this end we consider sequencesand corresponding sequences (u, v) and μ which solve problems (2.3) and (2.4), respectively. We are interested in studying the behaviour of (u, v) as n → ∞ and would like to prove that u converges to a -minimizing solution (cf (1.10)) whereas v approaches the exact data . If the fidelity function depends on the difference of the arguments, i.e. , then it does not matter if we choose or in (3.1c). For asymmetric fidelities such as the Kullback–Leibler divergence it does. If we think of the Kullback–Leibler divergence DKL(p|q) as the amount of information lost by using q instead of p (see [31]), then it actually makes sense to choose in (3.1c), i.e. to measure the amount of information lost by using the noisy measurement f instead of the exact one . We start with results that do not require the existence of a dual solution and are valid under general assumptions (cf theorem 2.1).

Convergence of primal solutions

We consider a sequence of primal problems (2.3)where is defined as follows Under assumptions 1 and 2, we obtain the following standard result. Suppose that the regularisation functional and the fidelity function satisfy assumption 1 and the operators satisfy assumption 2. Suppose also that the regularisation parameter α is chosen such thatThen any solution u of the primal problem (2.3) converges weakly-* to a -minimising solution of (1.1)and v converges weakly-* to the exact data in (1.1) Comparing the value of the objective function at the optimum (u, v) and (which is a feasible point for all n), we getandSince , the value on the right-hand side is bounded uniformly in n. Hence, since sublevel sets of are weakly-* sequentially compact, u contains a weakly-* convergent subsequence (that we do not relabel) that converges to some Since A is weak-* to weak-* continuous by assumption and , we get thatSince (u, v) is feasible in (2.3) for all n, it holdsUsing weak-* closedness of order intervals (cf lemma A.4), we inferFrom (3.2) we get thatSince is lower semicontinuous jointly in both arguments, we obtainand henceTherefore, by (3.4) we haveSince is lower semicontinuous, (3.3) implies thathence u∞ is a -minimising solution. □

Convergence rates

In modern variational regularisation, (generalised) Bregman distances are typically used to study convergence of approximate solutions [32]. For a proper convex functional the generalised Bregman distance between corresponding to the subgradient is defined as followswhere denotes the subdifferential of at . The symmetric Bregman distance between u and w corresponding to and is defined as follows Bregman distances do not define a metric since they do not satisfy the triangle inequality and does not imply u = w. To obtain convergence rates, we will need to make an additional assumption on the regularity of the -minimising solution called the source condition. There are several variants of the source condition (e.g., [6, 33, 34]); we will use the variant from [6], which in our notation can be written as follows (Source condition). There exists , μ† ⩾ 0, such that The source condition (3.5) is equivalent to the standard oneIndeed, since and with , we get thatwhich implies (3.6) with . For the converse implication we note that since is a lattice, we can write an arbitrary as followswhere . Hence, (3.6) implies (3.5) with μ† ≔ (ω−, ω+).

Convergence rates in a one-sided Bregman distance

We start with a convergence rate in a one-sided Bregman distance , where p† ≔ −B*μ† is the subgradient from the source condition (3.5). Let assumptions of theorem 2.1 and assumption 3 be satisfied and (3.1) hold. Then the following estimate holdswhere p† = −B*μ† is the subgradient from assumption 3. We start with the following estimatewhere η is as defined in (3.1b) and we used the fact that Bu ⩽ Ev. Since (u, v) is primal optimal and is feasible, we get thatand therefore By the Fenchel–Young inequality, the term in the brackets is bounded by , hence □

Convergence rates in a symmetric Bregman distance

Under a stronger assumption that is an AM-space (cf theorem 2.6), we can obtain an estimate in a symmetric Bregman distance. Let assumptions of theorem 2.6 and assumption 3 be satisfied and (3.1) hold. Then the following estimate holdswhere the symmetric Bregman distance corresponds to the subgradients from assumption 3 and . The symmetric Bregman distance between u and is given by Since the pair is feasible for all n, we get that . It is also evident that . Combining this with the complementarity condition (2.9), we obtain Since the pair (u, v) is also feasible, we get that Bu ⩽ Ev and hencewhere ‖u‖ is bounded due to theorem 3.2. From the Fenchel–Young inequality and theorem 2.8 we get thathencewhich yields the desired estimate upon dividing by α. □

Applications to different fidelity terms

To apply theorems 3.5 or 3.6, we need to study the term separately for each fidelity term.

φ-divergences

Let be a convex function. For two probability measures ρ, ν on Ω with ρ ≪ ν the φ-divergence (often called f-divergence) is defined as followswhere φ(1) = 0. We refer to [35] for many examples and fundamental properties of φ-divergences. Since ρ and ν have unit mass, function φ is only determined up to the additive term c(x − 1) for . In particular, since φ is convex and meets φ(1) = 0, it is straightforward to see that one can always find a suitable such that φ(x) + c(x − 1) ⩾ 0 for all x > 0. Hence, we will without loss of generality assume that φ ⩾ 0. We take to be space of Radon measures on Ω equipped with the total variation norm and considerwhere is the set of probability measures and . We estimate the convex conjugate of as follows:for any . Since φ(1) = 0 and φ ⩾ 0, we know that φ*(0) = 0 and φ*(x) ⩾ x. Indeed, we have φ*(0) = sup − φ(x) = −inf φ(x) = 0 and, by the Fenchel–Young inequality, φ*(x) ⩾ x − φ(1) = x. This motivates us to assumewhere r(x)/x → 0 as x → 0. This is satisfied in many cases (examples will be provided later on). Let be as defined in (3.12) and let the assumptions of theorem 3.5 be satisfied. Suppose that , where μ† is the source element from assumption 3, and that (3.14) holds. Then the following convergence rate holdswhere p† = −B*μ† is the subgradient from assumption 3. Under the additional assumption that A, Al,u are bounded from as operators , we get the same rate for the symmetric Bregman distance (cf theorem 3.6). Taking h = αE*μ† and ν = f in (3.13), and using (3.14), we getand in combination with (3.7) this yields the assertion. □ KL-divergence. Here φ(x) = x log(x) − (x − 1), φ*(x) = e − 1 = x + r(x) with r(x) = x2/2 + x3/6… and we get thatwhich coincides with [4] in the case of an exact operator. For we get the optimal rate χ2-divergence. Here φ(x) = (x − 1)2 and . Again,and the optimal rate coincides with (3.17). Squared Hellinger distance. Here , and we getand the optimal rate coincides with (3.17). Total variation. For the total variation (of measures) we have andThen for any α = const such that we get that (Poisson noise). The main motivation for the use of the Kullback–Leibler divergence as a fidelity term is the modelling of Poisson noise [1]. If t denotes the exposure time, the measured data can be assumed to be generated by a Poisson process with intensity . In this case, the upper bound on the error in the Kullback–Leilbler divergence is given by [36]While in the deterministic setting, this estimate is sufficient to obtain convergence rates, the statistical setting requires further assumptions, in particular some concentration inequalities [2, 36, 37]. Suppose that the fidelity function is coercive in the following sensefor all , where λ ⩾ 1 and C > 0 are constants (we will assume with loss of generality that C = 1). Then under the assumptions of theorem 3.5 the following convergence rates holdwhere p† = −B*μ† is the subgradient from assumption 3. If α is chosen such that if λ > 1 and if λ = 1, we get the optimal rateIf is an AM-space (cf theorem 3.6), the same rate holds for the symmetric Bregman distance . Since convex conjugation is order-reversing, from (3.21) we obtain that for any (we will drop the subscripts and after the norms to simplify notation)where . We will consider the cases λ > 1 and λ = 1 separately. Let λ > 1. Then from theorem 3.5 we obtain Condition (3.21) implies that . Hence, using the Cauchy-Schwarz inequality, we obtain Let now λ = 1. Then for sufficiently small we obtain from theorem 3.5 For a sufficiently small but fixed α we get that □ The value matches the exact penalisation parameter in regularisation with one-homogeneous fidelity terms (e.g. [4, 6, 38]). Exact penalisation means that the regularisation parameters α do not have to be sent to zero in order to obtain convergence in the Bregman distance. It is observed if the subdifferential is no singleton. (Powers of norms). Theorem 3.9 obviously applies if the fidelity function is given by a power of the norm, i.e.This covers important cases such as the squared L2 norm fidelity which is used to model Gaussian noise and the L1 norm fidelity which is often used to model salt-and-pepper noise [39]. (Wasserstein distances). For any p ⩾ 1, the p-Wasserstein distance between two probability measures is defined as follows (cf [40])where Π(ρ, ν) is the space of probability measures on Ω × Ω with marginals ρ and ν. Let the data space be the closure of the space of Radon measures with respect to the Kantorovich–Rubinstein normwhere Lip denotes the Lipschitz constant [41]. Obviously it holds for all and if μ ⩾ 0 by choosing g ≡ 1 (it is known that the positive cone , and hence also the set of probability measures , is closed in the KR norm [41, theorem 8.9.4]). For any and a probability measure we letIt is well known that for any two probability measures It is also known that for any q ⩽ p and any two probability measures , the following relation holds [40]Hence, the data term defined in (3.22) satisfiesi.e. it is strongly coercive on KR(Ω). Note that it is not strongly coercive on equipped with the total variation norm. Hence, using theorem 3.9 we get the following optimal rate

Characteristic function of a norm ball

Let the fidelity function be as followsThis type of fidelity functions corresponds to the so-called residual method [15, 42] and allows one to explicitly use the noise level δ in the reconstruction (another way of doing so is the discrepancy principle, see section 4). It is clear thatWith this particular fidelity function the parameter α does not have any effect on the solutions of (2.3), hence with no loss of generality we will assume α = const for all n. The coercivity assumption (3.21) is not satisfied for this fidelity function (it is only weakly coercive, i.e. ‖v − f‖ → ∞ implies ) and theorem 3.9 does not apply. Let the fidelity function be as defined in (3.23). Then under the assumptions of theorem 3.5 the following convergence rate holdswhere p† = −B*μ† is the subgradient from assumption 3. If is an AM-space (cf theorem 3.6), the same rate holds for the symmetric Bregman distance . Taking the convex conjugate of defined in (3.23), we getHence,since . Plugging this into the estimate in theorem 3.5 (resp. theorem 3.6) and remembering that α = const for all n, we get the assertion. □

Sum of fidelities

Having studied a plethora of explicit examples of fidelity functions, we now turn to combinations of several fidelities, each of which can be studied as above. Let us assume that is the sum of two other fidelity functions and , i.e.,Such fidelities were studied e.g. in [43] and allow to simultaneously handle data from different modalities. Furthermore, in [44-46] fidelites of L1 + L2-type were analysed and used for image restoration in the presence of mixed Gaussian and impulse noise. If and are proper, it holdswhere the term on the right-hand side is the so-called infimal convolution of and . Let us assume that we have estimates of the formfor each of the fidelities. The functions R are assumed to be non-decreasing in both arguments and we set R(α, ⋅) = ∞ for α < 0. Combining (3.26) and (3.27) we obtainwhere we used the monotonicity properties of R in the last two steps. This shows that the convergence rate for can be estimated by the infimal convolution of the rates associated to and , i.e.If is an AM-space (cf theorem 3.6), the same rate holds for the symmetric Bregman distance .

Infimal convolution of fidelities

Let us consider the case that is given by the infimal convolution of two other fidelities and Such fidelities are also chosen for the removal of mixed noise in image restoration (see e.g. [47] for an application to hyperspectral unmixing and [48] and the references therein for image denoising with mixtures of Gaussian, impulse, and Poisson noise). Since the infimal convolution optimally decomposes v into a noise part w, which is small in , and a residual v − w, which is close to the data f in , such fidelities are more suitable for this purpose than the plain sum of fidelities, studied in the previous section. By standard calculus for infimal convolutions, if and are proper, it holdsFurthermore, under the hypothesis that is coercive, is bounded from below, and both are weakly-* lower semicontinuous convex functions, it holds that is weakly-* lower semicontinuous, proper, and exact (see [49] for the statement and [50] for a proof on Hilbert spaces which generalises to Banach spaces). The latter means that the infimum in the definition of is attained. In particular, there are such that andFurthermore, from (3.30) we get Consequently, we have to estimate the two terms in brackets which only depend on the individual fidelites and . In all the examples studied above, such estimates are available. Using the functions R defined in (3.27) above together with (3.31), we can estimate Hence, we get the statement that the rate of convergence of a infimal convolution of fidelities can be estimated by the sum of the individual rates associated to and , i.e.This is in contrast to the rate of a sum of fidelities being given by the infimal convolution of the rates, as shown in the previous section. If is an AM-space (cf theorem 3.6), the same rate holds for the symmetric Bregman distance .

Discrepancy principle

When the operator is known exactly, Morozov’s discrepancy principle [10, 33] can be used to select the regularisation parameter α. In the case of a squared norm fidelity this amounts to selecting α such thatwhere is the regularised solution corresponding the regularisation parameter α and τ > 1 is a parameter. Here we assume that (and not ) to be consistent with our earlier notation. Convergence rates for this choice of α in the case of an exact operator and an arbitrary convex regularisation functional were obtained in [11]. For the data fidelity given by the Kullback–Leibler divergence, the discrepancy principle is studied in [13]. In the case of an imperfect operator, the discrepancy principle needs to be modified. When the operator error is measured using the operator norm, i.e. one assumes that an approximate operator A is available such thatone can choose α as follows [15] (in the case of a squared norm fidelity in the Hilbert space setting)If the fidelity term is not based on a norm and does not satisfy the triangle inequality, such a generalisation is not available. Since in our case the operator error is explicitly accounted for through the constraints in (2.3), we can use the discrepancy principle in its original form (4.1) with an arbitrary fidelity term. We will choose α such thatwhere solves (2.3) with the regularisation parameter α and τ > 1 is a parameter. If the solution is unique, then we haveIn case of non-uniqueness, we can always choose a solution such that (4.4) is satisfied, following the argument in [12, proposition 3.5–remark 3.8] and using convexity of the objective function in (2.3).

Existence

In this section we study well-posedness of the discrepancy principle, meaning that there is a regularisation parameter α which meets (4.3). Let (u, v) be a solution of (2.3) corresponding to the parameter α > 0. Define the following functions: The function j(α) is monotone non-increasing and h(α) is monotone non-decreasing in α. The proof is similar to [51]. □ If either or is strictly convex, then h(α) and j(α) are indeed uniquely defined (the argument is similar to [38]). Otherwise the lemma applies to and for any solution (u, v) of (2.3). Since j and h are monotone functions, they are in particular continuous for almost all values of α > 0. Functions h and j defined in (4.5) are lower semicontinuous. We just sketch the proof. Letting α → α, one can easily see that the corresponding solutions (v, u) converge (up to a subsequence) weakly-* to (v, u) which solve the problem for α. Hence, by the lower semicontinuity of and the assertion follows. □ Suppose that for all nfor some constant C > 1, which does not depend on n. Then the discrepancy principle (4.3) is well-posed for all τ ∈ (1, C), i.e. there exists α > 0 and a solution of (2.3) corresponding to α = α and f = f such that (4.3) is satisfied. For every α > 0 because of the feasibility of we getand in particularfor almost all α > 0. Letting α ↓ 0 we obtain using the monotonicity of h thatOn the other hand, by assumption it holdsHence, in light of (4.6) and (4.7), and the monotonicity of h, there exists α > 0 such thatand τ can be chosen in (1, C). Since h is lower semicontinuous according to lemma 4.5, we get thatwhich proves the assertion. □ The assumption of theorem 4.6 is rather weak. For instance, if , one can show that v ⇀ *0 as α → ∞. Hence, one can relax the assumption to which, for δ sufficiently small, is fulfilled in many applications. Our goal in this section is to obtain convergence rates similar to those in theorem 3.5 (respectively theorem 3.6) for the parameter choice rule (4.3). Let α be chosen according to (4.3). Then the following inequality holds for all nIf conditions of theorem 2.6 are satisfied, then also the following inequality holds Comparing the value of the objective function in (2.3) at the optimal solution and and using (4.3), we get thatSince τ > 1, this yields the first inequality. For the second one we use the Fenchel–Young inequality. Subtracting (3.10b) from (3.10a) we obtainwhich completes the proof. □ Under assumptions of theorem 3.2 and with α chosen according to (4.3), converges weakly-* to a -minimising solution of (1.1), i.e. Since is bounded uniformly in n and , we immediately get the desired result following the proof of theorem 3.2. □ Let α be chosen according to (4.3). Then, under the assumptions of theorem 3.5, the following estimate holds for the one-sided Bregman distance between and where p† = −B*μ† is the subgradient from assumption 3. Under the assumptions of theorem 3.6 the same estimate holds for the symmetric Bregman distance. We start with the estimate (3.8). Using lemma 4.8, we obtainwhich yields the first assertion. For the second assertion, we use (3.10) and lemma 4.8 and obtain □ Strongly coercive fidelities. For a strongly coercive fidelity terms such that (3.21) holds, we immediately get, using the Cauchy-Schwarz inequality, thatand therefore we get the following ratewhich coincides with the optimal rate in theorem 3.9. φ-divergences. For any φ-divergence that satisfies Pinsker’s inequality [52] with exponent λwhere , we have the same situation as above. In particular, for the Kullback–Leibler divergence, the χ2-divergence an the squared Hellinger distance λ = 2 andwhich coincides with the optimal rate (3.17). We summarise all convergence rates for obtained in this paper in table 1.

Table 1.

Summary of convergence rates for different fidelities in terms of the data error δ, the operator error η and the regularisation parameter α. Whenever α is absent in the a priori rate, exact penalisation occurs and the rate is independent of α as long as it is smaller than a fixed constant. Optimal rates correspond to an optimal choice of α in the a priori rate.

Fidelity	A priori rate	Optimal rate	Discr. principle
KL- and χ²-divergences,	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left(\frac{\delta }{\alpha }+\alpha +\eta \right)$\end{document}Oδα+α+η	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left(\sqrt{\delta }+\eta \right)$\end{document}O(δ+η)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left(\sqrt{\delta }+\eta \right)$\end{document}O(δ+η)
sq. Hellinger distance
Total variation	O(δ + η)	O(δ + η)	O(δ + η)
Wasserstein-p distance	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left(\frac{\delta }{\alpha }+{\alpha }^{\frac{1}{p-1}}+{\delta }^{\frac{1}{p}}+\eta \right)$\end{document}Oδα+α1p−1+δ1p+η, p > 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left({\delta }^{\frac{1}{p}}+\eta \right)$\end{document}O(δ1p+η)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left({\delta }^{\frac{1}{p}}+\eta \right)$\end{document}O(δ1p+η)
Wasserstein-p distance	O(δ + η),p = 1
Characteristic function of a	O(δ + η)	O(δ + η)	O(δ + η)
norm ball
λ-strongly coercive fidelities	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left(\frac{\delta }{\alpha }+{\alpha }^{\frac{1}{\lambda -1}}+{\delta }^{\frac{1}{\lambda }}+\eta \right)$\end{document}Oδα+α1λ−1+δ1λ+η, λ > 1	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left({\delta }^{\frac{1}{\lambda }}+\eta \right)$\end{document}O(δ1λ+η)	\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} }{}$O\left({\delta }^{\frac{1}{\lambda }}+\eta \right)$\end{document}O(δ1λ+η)
λ-strongly coercive fidelities	O(δ + η),λ = 1

Conclusions

In this work we have proven convergence rates in Bregman distances for variational regularisation in Banach lattices for problems with imperfect forward operators and general fidelity functions. Our results apply to many classes of fidelity functions and recover known convergence rates for norm-type fidelities and the Kullback–Leibler divergence in the case of exact operators. In addition, we have derived convergence rates for sums and infimal convolutions of fidelity functions, as used for mixed-noise removal. Furthermore, we have analysed an extension of Morozov’s discrepancy principle to problems with operator errors in the Banach lattice setting, which does not rely on the triangle inequality and hence applies to a broader class of fidelity functions.

3 in total