Literature DB >> 35205600

Generalizations of Talagrand Inequality for Sinkhorn Distance Using Entropy Power Inequality.

Shuchan Wang1, Photios A Stavrou1, Mikael Skoglund2.   

Abstract

The distance that compares the difference between two probability distributions plays a fundamental role in statistics and machine learning. Optimal transport (OT) theory provides a theoretical framework to study such distances. Recent advances in OT theory include a generalization of classical OT with an extra entropic constraint or regularization, called entropic OT. Despite its convenience in computation, entropic OT still lacks sufficient theoretical support. In this paper, we show that the quadratic cost in entropic OT can be upper-bounded using entropy power inequality (EPI)-type bounds. First, we prove an HWI-type inequality by making use of the infinitesimal displacement convexity of the OT map. Second, we derive two Talagrand-type inequalities using the saturation of EPI that corresponds to a numerical term in our expressions. These two new inequalities are shown to generalize two previous results obtained by Bolley et al. and Bai et al. Using the new Talagrand-type inequalities, we also show that the geometry observed by Sinkhorn distance is smoothed in the sense of measure concentration. Finally, we corroborate our results with various simulation studies.

Entities:  

Keywords:  Schrödinger problem; Talagrand inequality; entropic optimal transport; entropy power inequality; log-concave measures

Year:  2022        PMID: 35205600      PMCID: PMC8871052          DOI: 10.3390/e24020306

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.524


1. Introduction

OT theory studies how to transport one measure to another in the path with minimal cost. The Wasserstein distance is the cost given by the optimal path and closely connected with information measures; see, e.g., [1,2,3,4,5]. During the last decade, OT has been studied and applied extensively, especially in the machine learning community; see, e.g., [6,7,8,9]. Entropic OT, a technique to approximate the solution of the original OT, was given for computational efficiency in [10]. A key concept in the entropic OT is the Sinkhorn distance, which is a generalization of the Wasserstein distance with an extra entropic constraint. Due to the extra entropic constraint in the domain of the optimization problem, randomness is added to the original deterministic system, and the total cost increases from the original Wasserstein distance to a larger value. Therefore, a natural question is how to quantify the extra cost caused by the entropic constraint. In this paper, we derive upper bounds for the quadratic cost of entropic OT, which are shown to include a term of entropy power responsible for quantifying the amount of uncertainty caused by the entropic constraint. This work is an extended version of [11].

1.1. Literature Review

The dynamical formulation of OT, also known as the Benamou–Brenier formula [12], generalizes the original Monge–Kantorovich formulation into a time-dependent problem. It changes the original distance problem (i.e., find the distance between two prescribed measures) into a geodesic problem (i.e., find the optimal path between two prescribed measures). Using the displacement convexity of relative entropy along the geodesic, functional inequalities such as HWI inequality and Talagrand inequality can be obtained (see, e.g., ([13] Chapter 20)). Talagrand inequality, first given in [1], upper bounds the Wasserstein distance by relative entropy. Recent results in [2,4] obtain several refined Talagrand inequalities with dimensional improvements on the multidimensional Euclidean space. These inequalities bound Wasserstein distance with entropy power, which is sharper compared to the original one with relative entropy. An analogue of the dynamical OT problem is the SP [14]. The SP aims to find the most likely evolution of a system of particles with respect to a reference process. The most likely evolution is called a Schrödinger bridge. SP and OT intersect on many occasions; see, e.g., [15,16,17]. The problem we study in this paper is in this intersection and mostly related to [15]. In particular, Léonard in [15] showed that the entropic OT with quadratic cost is equivalent to the SP with a Brownian motion as the reference process. He further derived that the Schrödinger bridge also admits a Benamou–Brenier formula with an additional diffusion term. Conforti in [18,19] claimed that the process can also be formulated as a continuity equation and proved that the acceleration of particles is the gradient of the Fisher information. The result therein leads to a generalized Talagrand inequality for relative entropy. Later, Bai et al. in [20] upper-bounded the extra cost from the Brownian motion by separating one Gaussian marginal into two independent random vectors. Using this approach, they showed that the dimensional improvement can be generalized to entropic OT and gave a Gaussian Talagrand inequality for the Sinkhorn distance. Additional results in [20] include a strong data processing inequality derived from their new Talagrand inequality and a bound on the capacity of the relay channel. Entropic OT has other interesting properties. For example, Rigollet and Weed studied the case with one side of empirical measure in [21]. Their result shows that entropic OT performs maximum-likelihood estimation for Gaussian deconvolution of the empirical measure. This result can be further applied in uncoupled isotonic regression (see [9]). The dimensionality is also observed in the applications of entropic OT. For example, sample complexity bounds in [22,23] appear to be dimensional-dependent. In the GAN model, Reshetova et al. in [24] showed that the entropic regularization of OT promotes sparsity in the generated distribution. Another element in our paper is EPI (for details on EPI, see, e.g., [25,26,27]). This inequality provides a clear expression to bound the differential entropy of two distributions’ convolution. We refer the interested reader to [28,29,30,31,32] for the connections between EPI and functional inequalities, and [33] for the connections between EPI and SP.

1.2. Contributions

In this paper, we upper-bound the quadratic cost of entropic OT by deconvolution of one side measure and EPIs. Using this approach, we avoid any discussion related to the dynamics of SP and instead we capture the uncertainty caused by the Brownian motion quantitatively. Our contributions can be articulated as follows: We derive an HWI-type inequality for Sinkhorn distance using a modification of Bolley’s proof in [4] (see Theorem 2). We prove two new Talagrand-type inequalities (see Theorems 3 and 4). These inequalities are obtained via a numerical term C related to the saturation, or the tightness, of EPI. We claim that this term can be computed with arbitrary deconvolution of one side marginal, while the optimal deconvolution is shown to be unknown beyond the Gaussian case. Nevertheless, we simulate suboptimally this term for a variety of distributions in Figure 1.
Figure 1

Plot of the numerical term C subject to the information constraint R evaluated with respect to different distributions for the one-dimensional case.

We show that the geometry observed by Sinkhorn distance is smoothed in the sense of measure concentration. In other words, Sinkhorn distance implies a dimensional measure concentration inequality following Marton’s method (see Corollary 2). This inequality has a simple form of normal concentration that is related to the term C and is weaker than the one implied by Wasserstein distance. Our theoretical results are validated via numerical simulations (see Section 4). These simulations reveal several reasons for which our bounds can be either tight or loose.

Connections to Prior Art

The novelty of our work is that it comprises naturally ideas from Bolley et al. in [4] and from Bai et al. in [20] to develop new entropic OT inequalities. The dimensional improvement of Bolley et al. in [4] separates an independent term of entropy power from the original Talagrand inequality. This allows us to utilize an approach to study the entropic OT problem, which is the OT with randomness, based on the convolutional property of entropy power. On the other hand, we generalize the constructive proof of Bai et al. in [20], where they separate one Gaussian random vector into two independent Gaussian random vectors. We further claim that, for any distribution, we can always find similar independent pairs satisfying several assumptions, to upper-bound the Sinkhorn distance. As a consequence of the above, our results generalize the Talagrand inequalities of Bolley et al. in ([4] Theorem 2.1) from classical OT to entropic OT and the results of Bai et al. in ([20] Theorem 2.2) from the Gaussian case to the strongly log-concave case. In particular, we show that Theorem 3 recovers ([4] Theorem 2.1) (see Corollary 1 and the discussion in Remark 6) and that Theorem 4 recovers ([20] Theorem 2.2) (see Remark 9). It should be noted that in our analysis, we focus on the primal problem defined in [10], as opposed to the studies of its Lagrangian dual in [18,19].

1.3. Notation

is the set of positive integers . is the set of real numbers. is the n-dimension Euclidean space. denotes the set . Let be two Polish spaces, i.e., separable complete metric spaces. We write an element in lower-case letters and a random vector X on in capital letters. We denote as the set of all probability measures on . Let be a Borel measure on . For a measurable map , denotes the pushing forward of to , i.e., for all , . For , or denotes the Lebesgue space of p-th order for the reference measure . ∇ is the gradient operator, is the divergence operator, is the Laplacian operator, is the Hessian operator, is the n-dimension identity matrix, is the identity map, is the Euclidean norm, is the set of functions that is k-times continuously differentiable, Ric is the Ricci curvature. , , , , denote differential entropy, mutual information, relative entropy, Fisher information and relative Fisher information, respectively. All the logarithms are natural logarithms. is unique existence. * is the convolution operator.

1.4. Organization of the Paper

The rest of the paper is organized as follows: in Section 2, we give the technical preliminaries of the theories and tools that we use; in Section 3, we state our main theoretical results; in Section 4, we give numerical simulations for our theorems, and in Section 5, we give the conclusions and future directions. Long proofs and background material are included in the Appendix.

2. Preliminaries

In this section, we give an overview of the theories and tools that we use.

2.1. Synopsis of Optimal Transport

We first give a brief introduction of OT theory. The OT problem was initialized by Gaspard Monge. The original formulation can be described as follows. (Monge Problem [34]). Let Then, Kantorovich gave a probabilistic interpretation to the OT. This is stated next. (Kantorovich Problem [35]). Let X and Y be two random vectors on two Polish spaces It can be further proven that (2) gives the same optimizer as (1) (see, e.g., [36]). One can define the Wasserstein distance ([13] Definition 6.1) from (2). Let and let d be a metric on . Then, the Wasserstein distance of order , is defined as follows: We note that the Wasserstein distance is a metric between two measures. Cuturi in [10] gave the concept of entropic OT. In this definition, he adds an information theoretic constraint to (2), i.e., where with denoting the mutual information [32] between X and Y, and . It is well known that the constraint set is convex and compact with respect to the topology of weak convergence (for details, see, e.g., ([13] Lemma 4.4), ([37] Section 1.4)). Using the lower semi-continuity of and ([13] Lemma 4.3), we know that the objective function is also lower semi-continuous. Using the compactness of the constraint set and the lower semi-continuity of f, then, from Weierstrass’ extreme value theorem, the minimum in (4) is attained. Moreover, the solution is always located on its boundary, i.e., , because the objective function of (4) is linear. Entropic OT is an efficient way to approximate solutions of the Kantorovich problem. The Lagrangian dual of (4), which was introduced by Cuturi in [10], can be solved iteratively. The dual problem of (4) can be reformulated as follows: where is a Lagrange multiplier. Using the Lagrange duality theorem ([38] Theorem 1, pp. 224–225), it can be shown that (4) and (5) give the same optimizer . The uncertainty of entropic OT can be understood as follows. We can write , where is fixed. The conditional entropy encapsulates the randomness of the conditional distribution. The randomness decreases when increases. Thus, unlike (1) and (2), there is no deterministic map anymore for (4) and (5), because a one-to-one mapping leads to infinite mutual information. Note that in (5) also has an explicit physical meaning. In particular, entropic OT with quadratic cost coincides with SP with a reference measure of Brownian motion (see [15]). Then, is a diffusion coefficient of the Fokker–Planck equation associated with the Schrödinger bridge. In our main results, we study (4) instead of (5) for two reasons. First, the mutual information in (4) gives a global description of the amount of uncertainty, while the coefficient in (5) and its associated Fokker–Planck equation are more related to local properties, from the definitions of the Lagrangian dual and Fokker–Planck equation. Further on this point, there is no explicit expression for the correspondence between R and in the duality. Second, the expectation of cost function in (2) is comparable to the Wasserstein distance. As we demonstrate in the following, it gives a smooth version of the Wasserstein distance. Similar to the Wasserstein distance, the Sinkhorn distance of order p is defined as follows: Clearly, is a subset of . Because of the minimization problem, it is easy to see that . For this reason, we say that entropic OT is a smoothed version of classical OT. We note that the Sinkhorn distance is not a metric because it does not fulfill the axiom of identity of indiscernibles. Since entropic OT is concerned with mutual information, it may be of interest to introduce a conditional Sinkhorn distance. This is defined as follows: where the conditional mutual information and . Conditional Sinkhorn distance is utilized in [20] and leads to a data processing inequality. Since the constraint of conditional mutual information is a linear form of , the constraint set is still convex. The objective function is also a linear form of P. Therefore, the functional and topological properties of the conditional Sinkhorn distance are similar to the unconditional one. Next, we state some known results of Talagrand inequality [1]. (Talagrand Inequality). Let Remark 1. ( When going beyond the Euclidean space to a manifold, Otto and Villani in [40] showed that the Bakry–Emery condition also implies . Recently, refined inequalities with dimensional improvements were obtained in multidimensional Euclidean space. These dimensional improvements were first observed in the Gaussian case of logarithmic Sobolev inequality, Brascamp–Lieb (or Poincaré) inequality [41] and Talagrand inequality [2]. For a standard Gaussian measure , the dimensional Talagrand inequality has the form: Bolley et al. in [4] generalized the results in [2,41] from Gaussian to strongly log-concave or log-concave. Next, we state their result. Let , where is continuous, . Bolley’s dimensional Talagrand inequality is given as follows: The dimensional Talagrand inequalities (9) and (10) are tighter than (8). To see this result, one may refer to our Remark 6 below. Bai et al. in [20] gave a generalization of (9) to Sinkhorn distance. When is standard Gaussian, When , this inequality coincides with (9).

2.2. Measure Concentration

The measure concentration phenomenon describes how the probability of a random variable X changes with the deviation from a given value such as its mean or median. Marton introduced an approach of concentration directly on the level of probability measures using OT (see, e.g., ([13] Chapter 22)). To give the notation of concentration of measure, we first introduce the probability metric space. Let be a Polish space. Let d be a metric on . Let be a probability measure defined on the Borel set of . Then, we say that the triple is a probability metric space. For an arbitrary set and any , we define as where . Then, we say that a probability measure has normal (or Gaussian) concentration on if there exists positive K and such that There is another weaker statement of normal concentration, such that It is not difficult to see that (12) can be obtained from (13), possibly with degraded constants, i.e., larger K and/or smaller . The next theorem gives the connection between normal concentration and Talagrand inequality. (Theorem 3.4.7 [5]). Let μ satisfies μ has a dimension-free normal concentration with The intuition behind Marton’s method is that OT theory can give a metric between two probability measures by the metric structure of the supporting Polish space. The metric can be further connected with probability divergence using Talagrand inequality.

2.3. Entropy Power Inequality and Deconvolution

EPI [25] states that, for all independent continuous random vectors X and Y, where denotes the entropy power of X. The equality is achieved when X and Y are Gaussian random vectors with proportional covariance matrices. Deconvolution is a problem of estimating the distribution by the observations ,..., corrupted by additive noise ,...,, written as where and . ’s are i.i.d. in , ’s are i.i.d in . ’s and ’s are mutually independent. Let be the probability density function of Y that is given by the convolution . Then, their entropies can be bounded by EPI directly. In our problem, we slightly abuse the concept by simply separating a random vector Y into two independent random vectors X and Z. We use this approach to introduce the uncertainty to entropic OT and consequently bound the Sinkhorn distance by EPI. Deconvolution is generally a more challenging problem than convolution. For instance, the log-concave family is convolution stable, i.e., convolution of two log-concave distributions is still log-concave, but we cannot guarantee that the deconvolution of two log-concave distributions is still log-concave. A trivial case is that wherein the deconvolution of a log-concave distribution by itself is a Dirac function. Moreover, f may not in general be positive or integrable for arbitrary given g and h, as shown in [42]. However, it should be noted that there are many numerical methods to compute deconvolution; see, e.g., [42,43,44].

3. Main Theoretical Results

In this section, we derive our main theoretical results. First, we give a new HWI-type inequality. (HWI-Type Inequality). Let where the relative Fisher information . See Appendix A. □ In Theorem 2, we construct The next result gives a new Talagrand-type inequality. (Talagrand-Type Inequality). Let where Let in (15). In such a case, we have from the definition of relative Fisher information. Take ; then, (16) is proven from (15). □ Next, we state some technical remarks on Theorem 3. (On Theorem 3). In Theorem 3, we show that the Sinkhorn distance of two random vectors can be upper-bounded by a difference of a functional on two marginals, i.e., (On the numerical term ). The numerical term Therefore, Moreover, we can show that there always exists such a sequence C non-decreasing with respect to R. We know that We note that, for particular distributions, we may have an explicit expression of As a result, we have Note that the linear combination subject to where This means that the saturation of EPI is controlled by In Plot of the numerical term C subject to the information constraint R evaluated with respect to different distributions for the one-dimensional case. (On the condition of identity of Theorem 3). To show the condition of identity of ( The following corollary is immediate from Theorem 3. Wasserstein distance is bounded by This is immediate from Theorem 3 when . In this case, . □ (On Corollary 1). We note that ( We notice that C is the only difference between (10) and (16), from Remark 6. Therefore, we can immediately obtain a result related to measure concentration following ([4] Corollary 2.4). Next, we state the result on measure concentration obtained from (16). Let See Appendix B. □ Next, we state some technical comments on Corollary 2. (On Corollary 2). We note that in the derivation of Corollary 2, we follow the method of Marton in [ The next theorem is another Talagrand-type inequality. Compared to Theorem 3, the following result is a bound obtained using a term related to the saturation of , instead of the saturation of that was used in Theorem 3. Let where See Appendix D. □ We offer the following technical comments on Theorem 4. (On Theorem 4). Similar to Theorem 4 can also give a measure concentration inequality, namely where When which is exactly the same as ( The next theorem gives a Talagrand-type bound for the conditional Sinkhorn distance. (Talagrand-type bound for conditional Sinkhorn distance). Let where See Appendix E. □

4. Numerical Simulations

In this section, we describe several numerical simulations to illustrate the validity of our theoretical findings. To check the tightness of our bounds, we use as a reference bound the numerical solution obtained via the Sinkhorn algorithm, which can be found in the POT library [47]. As an iterative method, the Sinkhorn algorithm has computational error, since the iteration stops when it converges to a certain rate. For example, in Figure 2a, we plot the result for Theorem 3 with two Gaussian marginals, which is the scenario when the identity of (16) holds. From the figure, we can see that the simulation is slightly greater than the bound. Nevertheless, we note that the precision is reasonably small.
Figure 2

Numerical simulations and bounds via (16) for different R. (a) and . (b) and . (c) and .

The simulations for Theorems 3 and 4 are given in Figure 2 and Figure 3, respectively. Since the optimal value of C in Theorem 3 and the error term in Theorem 4 beyond the linear case are unknown, we mainly simulate the case with one side Gaussian, i.e., with and . In this way, we avoid the unknown factors and deduce several observations related to the tightness of the bounds derived in these two theorems.
Figure 3

Numerical simulations and bounds via (21) for different R. (a) and . (b) and . (c) and . (d) and . (e) and is Gamma distribution with and . (f) and .

The first observation is about absolute continuity. We observe that the original Talagrand inequality (8) is not tight when is not absolutely continuous with respect to , because in this case. In Figure 4, we illustrate one such case with almost discontinuity between two strongly log-concave distributions, i.e., the Radon–Nikodym derivative , where is a normalizing factor, goes to ∞ when . Consequently, the bound (16) from Theorem 3 is loose, as illustrated in Figure 5. The bound can be much looser if we increase the discontinuity, i.e., we let , as shown in Figure 6. By simply changing the sides of distributions and , we preserve the absolute continuity and the bound becomes tight, as we can see in Figure 5b and Figure 6b.
Figure 4

Probability densities of and .

Figure 5

Numerical simulations and bounds for different R, with and . (a) Bound via (16) with , . (b) Bound via (21) with , .

Figure 6

Numerical simulations and bounds for different R, with and . (a) Bound via (16) with , . (b) Bound via (21) with , .

The second observation is related to the numerical term C. By comparing Figure 2b and Figure 3a, we observe that gives a better description than , i.e., the former one gives a tighter bound. This is reasonable according to our previous discussion, i.e., the independent linear combination of Cauchy random variables is not the optimal deconvolution. Actually, even if is not Gaussian in (16), seems to be true for all the simulated distributions. Furthermore, we observe that the tightness of the bounds in Theorems 3 and 4 is related to the linearity of the transport map, which can be seen as a similarity between the two marginals. For example, Cauchy and Laplace distributions are similar to the Gaussian distribution. Thus, they show a tight bound in Figure 3a,d. On the other hand, Gaussian mixture and exponential distribution are relatively far from the Gaussian distribution. Hence, Figure 3c,f give looser bounds. In Figure 7, we plot the dimensionality of Sinkhorn distance between isotropic Gaussians. Different curves correspond to a pair of Gaussian distributions in different dimensions and these pairs have the same Wasserstein distance. It can be seen that the information constraint causes more smoothing in higher dimensions, which is consistent with Corollary 2.
Figure 7

Sinkhorn distances between isotropic Gaussians in different dimensions.

Numerical simulations and bounds via (16) for different R. (a) and . (b) and . (c) and . Numerical simulations and bounds via (21) for different R. (a) and . (b) and . (c) and . (d) and . (e) and is Gamma distribution with and . (f) and . Probability densities of and . Numerical simulations and bounds for different R, with and . (a) Bound via (16) with , . (b) Bound via (21) with , . Numerical simulations and bounds for different R, with and . (a) Bound via (16) with , . (b) Bound via (21) with , . Sinkhorn distances between isotropic Gaussians in different dimensions.

5. Conclusions and Future Directions

In this paper, we considered a generalization of OT with an entropic constraint. We showed that the constraint leads to uncertainty and the uncertainty can be captured by EPI. We first derived an HWI-type inequality for the Sinkhorn distance. Then, we derived two Talagrand-type inequalities. Because of the strong geometric implication of Talagrand inequality, these two Talagrand-type inequalities can also give a weaker measure concentration inequality, respectively. From this result, we claimed that the geometry implied by the Sinkhorn distance can be smoothed out by the entropic constraint. We also showed that our results can be generalized into a conditional version of entropic OT inequality. However, there are two factors unknown in the inequalities we derived, i.e., the optimal value of the term C in Theorem 3 and the error term in Theorem 4 when one goes beyond the linear case. Although we showed that we can compute a suboptimal C using the arbitrary linear combination of two random vectors, the optimal value is an intriguing open question to answer. We believe that the improvement of the term C may be related to the Fisher information. Without the assumption of strong log-concavity, it requires an extra term of relative Fisher information to upper-bound the Wasserstein distance in Theorem 2. The reversing of EPI in [31] is also concerned with Fisher information. If we consider the changing of Fisher information along the Schrödinger bridge, a better estimate of term C may be feasible.
  1 in total

1.  Log-Concavity and Strong Log-Concavity: a review.

Authors:  Adrien Saumard; Jon A Wellner
Journal:  Stat Surv       Date:  2014-12-09
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.