Literature DB >> 35327931

Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences.

Frank Nielsen1.   

Abstract

By calculating the Kullback-Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel-Young divergence. Inspired by this formula, we define the duo Fenchel-Young divergence and report a majorization condition on its pair of strictly convex generators, which guarantees that this divergence is always non-negative. The duo Fenchel-Young divergence is also equivalent to a duo Bregman divergence. We show how to use these duo divergences by calculating the Kullback-Leibler divergence between densities of truncated exponential families with nested supports, and report a formula for the Kullback-Leibler divergence between truncated normal distributions. Finally, we prove that the skewed Bhattacharyya distances between truncated exponential families amount to equivalent skewed duo Jensen divergences.

Entities:  

Keywords:  exponential family; statistical divergence; truncated exponential family; truncated normal distributions

Year:  2022        PMID: 35327931      PMCID: PMC8947456          DOI: 10.3390/e24030421

Source DB:  PubMed          Journal:  Entropy (Basel)        ISSN: 1099-4300            Impact factor:   2.524


1. Introduction

1.1. Exponential Families

Let be a measurable space, and consider a regular minimal exponential family [1] of probability measures all dominated by a base measure (): The Radon–Nikodym derivatives or densities of the probability measures with respect to can be written canonically as where denotes the natural parameter, the sufficient statistic [1,2,3,4], and the log-normalizer [1] (or cumulant function). The optional auxiliary term allows us to change the base measure into the measure such that . The order D of the family is the dimension of the natural parameter space : where denotes the set of reals. The sufficient statistic is a vector of D functions. The sufficient statistic is said to be minimal when the functions 1, , …, are linearly independent [1]. The sufficient statistics are such that the probability . That is, all information necessary for the statistical inference of parameter is contained in . Exponential families are characterized as families of parametric distributions with finite-dimensional sufficient statistics [1]. Exponential families include among others the exponential, normal, gamma/beta, inverse gamma, inverse Gaussian, and Wishart distributions once a reparameterization of the parametric distributions is performed to reveal their natural parameters [1]. When the sufficient statistic is x, these exponential families [1] are called natural exponential families or tilted exponential families [5] in the literature. Indeed, the distributions of the exponential family can be interpreted as distributions obtained by tilting the base measure [6]. In this paper, we consider either discrete exponential families like the family of Poisson distributions (univariate distributions of order with respect to the counting measure) or continuous exponential families like the family of normal distributions (univariate distributions of order with respect to the Lebesgue measure). The Radon–Nikodym derivative of a discrete exponential family is a probability mass function (pmf), and the Radon–Nikodym derivative of a continuous exponential family is a probability density function (pdf). The support of a pmf is (where denotes the set of integers) and the support of a d-variate pdf is . The Poisson distributions have support where denotes the set of natural numbers . Densities of an exponential family all have coinciding support [1].

1.2. Truncated Exponential Families with Nested Supports

In this paper, we shall consider truncated exponential families [7] with nested supports. A truncated exponential family is a set of parametric probability distributions obtained by truncation of the support of an exponential family. Truncated exponential families are exponential families but their statistical inference is more subtle [8,9]. Let be a truncated exponential family of with nested supports . The canonical decompositions of densities and have the following expressions: where the log-normalizer of the truncated exponential family is: where is a normalizing term that takes into account the truncated support . These equations show that densities of truncated exponential families only differ by their log-normalizer functions. Let denote the support of the distributions of and the support of . Family is a truncated exponential family of that can be notationally written as . Family can also be interpreted as the (un)truncated exponential family with densities . A truncated exponential family of is said to have nested support when . For example, the family of half-normal distributions defined on the support is a nested truncated exponential family of the family of normal distributions defined on the support .

1.3. Kullback–Leibler Divergence between Exponential Family Distributions

For two -finite probability measures P and Q on such that P is dominated by Q (), the Kullback–Leibler divergence between P and Q is defined by where denotes the expectation of a random variable [10]. When , we set . Gibbs’ inequality [11] shows that the Kullback–Leibler divergence (KLD for short) is always non-negative. The proof of Gibbs’ inequality relies on Jensen’s inequality and holds for the wide class of f-divergences [12] induced by convex generators : The KLD is an f-divergence obtained for the convex generator .

1.4. Kullback–Leibler Divergence between Exponential Family Densities

It is well-known that the KLD between two distributions and of amounts to computing an equivalent Fenchel–Young divergence [13]: where is the moment parameter [1] and is the gradient of F with respect to . The Fenchel–Young divergence is defined for a pair of strictly convex conjugate functions [14] and related by the Legendre–Fenchel transform by Amari (1985) first introduced this formula as the canonical divergence of dually flat spaces in information geometry [15] (Equation 3.21), and proved that the Fenchel–Young divergence is obtained as the KLD between densities belonging to the same exponential family [15] (Theorem 3.7). Azoury and Warmuth expressed the KLD using dual Bregman divergences in [13] (2001): where a Bregman divergence [16] is defined for a strictly convex and differentiable generator by: Acharyya termed the divergence the Fenchel–Young divergence in his PhD thesis [17] (2013), and Blondel et al. called such divergences Fenchel–Young losses (2020) in the context of machine learning [18] (Equation (9) in Definition 2). This term was also used by the author the Legendre–Fenchel divergence in [19]. The Fenchel–Young divergence stems from the Fenchel–Young inequality [14,20]: with equality if and only if . Figure 1 visualizes the 1D Fenchel–Young divergence and gives a geometric proof that with equality if and only if . Indeed, by considering the behavior of the Legendre–Fenchel transformation under translations: we may assume without loss of generality that . The function is strictly increasing and continuous since is a strictly convex and differentiable convex function. Thus we have and .
Figure 1

Visualizing the Fenchel–Young divergence.

if then for all , and if then for all , Visualizing the Fenchel–Young divergence. The Bregman divergence amounts to a dual Bregman divergence [13] between the dual parameters with swapped order: where for . Thus the KLD between two distributions and of can be expressed equivalently as follows: The symmetrized Kullback–Leibler divergence between two distributions and of is called Jeffreys’ divergence [21] and amounts to a symmetrized Bregman divergence [22]: Note that the Bregman divergence can also be interpreted as a surface area: Figure 2 illustrates the sided and symmetrized Bregman divergences.
Figure 2

Visualizing the sided and symmetrized Bregman divergences.

1.5. Contributions and Paper Outline

We recall in Section 2 the formula obtained for the Kullback–Leibler divergence between two exponential family densities equivalent to each other [23] (Equation (29)). Inspired by this formula, we give a definition of the duo Fenchel–Young divergence induced by a pair of strictly convex functions and (Definition 1) in Section 3, and prove that the divergence is always non-negative provided that upper bounds . We then define the duo Bregman divergence (Definition 2) corresponding to the duo Fenchel–Young divergence. In Section 4, we show that the Kullback–Leibler divergence between a truncated density and a density of a same parametric exponential family amounts to a duo Fenchel–Young divergence or equivalently to a duo Bregman divergence on swapped parameters (Theorem 1). That is, we consider a truncated exponential family [7] of an exponential family such that the common support of the distributions of is contained in the common support of the distributions of and both canonical decompositions of the families coincide (see Equation (2)). In particular, when is also a truncated exponential family of , then we express the KLD between two truncated distributions as a duo Bregman divergence. As examples, we report the formula for the Kullback–Leibler divergence between two densities of truncated exponential families (Corollary 1), and illustrate the formula for the Kullback–Leibler divergence between truncated exponential distributions (Example 6) and for the Kullback–Leibler divergence between truncated normal distributions (Example 7). In Section 5, we further consider the skewed Bhattacharyya distance between densities of truncated exponential families and prove that it amounts to a duo Jensen divergence (Theorem 2). Finally, we conclude in Section 6.

2. Kullback–Leibler Divergence between Different Exponential Families

Consider now two exponential families [1] and defined by their Radon–Nikodym derivatives with respect to two positive measures and on : The corresponding natural parameter spaces are The order of is D, denotes the sufficient statistics of , and is a term to adjust/tilt the base measure . Similarly, the order of is , denotes the sufficient statistics of , and is an optional term to adjust the base measure . Let and denote the Radon–Nikodym derivatives with respect to the measures and , respectively: where and denote the corresponding log-normalizers of and , respectively. The functions and are strictly convex and real analytic [1]. Hence, those functions are infinitely many times differentiable on their open natural parameter spaces. Consider the KLD between and such that (and hence ). Then the KLD between and was first considered in [23]: Recall that the dual parameterization of an exponential family density is with [1], and that the Fenchel–Young equality is for . Thus the KLD between and can be rewritten as This formula was reported in [23] and generalizes the Fenchel–Young divergence [17] obtained when (with , , and and ). The formula of Equation (29) was illustrated in [23] with two examples: the KLD between Laplacian distributions and zero-centered Gaussian distributions, and the KLD between two Weibull distributions. Both these examples use the Lebesgue base measure for and . Let us report another example that uses the counting measure as the base measure for and . Consider the KLD between a Poisson probability mass function (pmf) and a geometric pmf. The canonical decompositions of the Poisson and geometric pmfs are summarized in Since Note that we can calculate the KLD between two geometric distributions We obtain:

3. The Duo Fenchel–Young Divergence and Its Corresponding Duo Bregman Divergence

Inspired by formula of Equation (29), we shall define the duo Fenchel–Young divergence using a dominance condition on a pair of strictly convex generators. (duo Fenchel–Young divergence). Let When , we have , and we retrieve the ordinary Fenchel–Young divergence [17]: Note that in Equation (35), we have . (Non-negative duo Fenchel–Young divergence). The duo Fenchel–Young divergence is always non-negative. The proof relies on the reverse dominance property of strictly convex and differentiable conjugate functions: (Reverse majorization order of functions by the Legendre–Fenchel transform). Let This property is graphically illustrated in Figure 3. The reverse dominance property of the Legendre–Fenchel transformation can be checked algebraically as follows:
Figure 3

(a) Visual illustration of the Legendre–Fenchel transformation: is measured as the vertical gap (left long black line with both arrows) between the origin and the hyperplane of the “slope” tangent at evaluated at . (b) The Legendre transforms and of two functions and such that reverse the dominance order: .

Thus we have when . Therefore it follows that since we have where is the ordinary Fenchel–Young divergence, which is guaranteed to be non-negative from the Fenchel–Young inequality. □ (a) Visual illustration of the Legendre–Fenchel transformation: is measured as the vertical gap (left long black line with both arrows) between the origin and the hyperplane of the “slope” tangent at evaluated at . (b) The Legendre transforms and of two functions and such that reverse the dominance order: . We can express the duo Fenchel–Young divergence using the primal coordinate systems as a generalization of the Bregman divergence to two generators that we term the duo Bregman divergence (see Figure 4): with .
Figure 4

The duo Bregman divergence induced by two strictly convex and differentiable functions and such that . We check graphically that (vertical gaps).

This generalized Bregman divergence is non-negative when . Indeed, we check that (duo Bregman divergence). Let Consider Let when When Consider with We now state a property between dual duo Bregman divergences: (Dual duo Fenchel–Young and Bregman divergences). We have From the Fenchel–Young equalities of the inequalities, we have for and with . Thus we have Recall that implies that (Lemma 1), , and therefore the dual duo Bregman divergence is non-negative:

4. Kullback–Leibler Divergence between Distributions of Truncated Exponential Families

Let be an exponential family of distributions all dominated by with Radon–Nikodym density defined on the support . Let be another exponential family of distributions all dominated by with Radon–Nikodym density defined on the support such that . Let be the common unnormalized density so that and with and being the log-normalizer functions of and , respectively. We have Since and , we obtain Observe that since , we have: Therefore , and the common natural parameter space is . Notice that the reverse Kullback–Leibler divergence since . (Kullback–Leibler divergence between truncated exponential family densities). Let For example, consider the calculation of the KLD between an exponential distribution (view as half a Laplacian distribution, i.e., a truncated Laplacian distribution on the positive real support) and a Laplacian distribution defined on the real line support. Let Moreover, we can interpret that divergence using the Itakura–Saito divergence [ we have We check the result using the duo Fenchel–Young divergence: with Next, consider the calculation of the KLD between a half-normal distribution and a (full) normal distribution: Consider Since Moreover, we can interpret those Bregman divergences as half of the Itakura–Saito divergence: It follows that Since Thus the Kullback–Leibler divergence between a truncated density and another density of the same exponential family amounts to calculate a duo Bregman divergence on the reverse parameter order: . Let be the reverse Kullback–Leibler divergence. Then . Notice that truncated exponential families are also exponential families but those exponential families may be non-steep [25]. Let and be two truncated exponential families of the exponential family with log-normalizer such that with , where denotes the CDF of . Then the log-normalizer of is for . (Kullback–Leibler divergence between densities of truncated exponential families). Let We have and . Therefore . Thus we have Thus the KLD between truncated exponential family densities and amounts to the KLD between the densities with the same truncation parameter with an additive term depending on the log ratio of the mass with respect to the truncated supports evaluated at . We shall illustrate with two examples the calculation of the KLD between truncated exponential families. Consider the calculation of the KLD between a truncated exponential distribution If Thus we have: When Note that the KLD between two truncated exponential distributions with the same truncation support We also check Corollary 1: The next example shows how to compute the Kullback–Leibler divergence between two truncated normal distributions: Let where with where Thus we have The pdf can also be written as where and The density The natural parameter space is The log-normalizer can be expressed using the source parameters We shall use the fact that the gradient of the log-normalizer of any exponential family distribution amounts to the expectation of the sufficient statistics [ Parameter η is called the moment or expectation parameter [ The mean where Now consider two truncated normal distributions Note that This formula is valid for (1) the KLD between two truncated normal distributions, or for (2) the KLD between a truncated normal distribution and a (full support) normal distribution. Note that the formula depends on the erf function used in function Φ. Furthermore, when Note that for full support normal distributions, we have The entropy of a truncated normal distribution (an exponential family [ When

5. Bhattacharyya Skewed Divergence between Truncated Densities of an Exponential Family

The Bhattacharyya -skewed divergence [29,30] between two densities and with respect to is defined for a skewing scalar parameter as: where denotes the support of the distributions. The Bhattacharyya distance is The Bhattacharyya distance is not a metric distance since it does not satisfy the triangle inequality. The Bhattacharyya distance is related to the Hellinger distance [31] as follows: The Hellinger distance is a metric distance. Let denote the skewed affinity coefficient so that . Since , we have . Consider an exponential family with log-normalizer . Then it is well-known that the -skewed Bhattacharyya divergence between two densities of an exponential family amounts to a skewed Jensen divergence [30] (originally called Jensen difference in [32]): where the skewed Jensen divergence is defined by The convexity of the log-normalizer ensures that . The Jensen divergence can be extended to full real by rescaling it by , see [33]. The Bhattacharyya skewed divergence The sufficient statistic is Now, consider calculating where with a truncated exponential family of and . We have , where and are the partition functions of and , respectively. Thus we have and the -skewed Bhattacharyya divergence is Therefore we obtain We call the duo Jensen divergence. Since , we check that Figure 7 illustrates graphically the duo Jensen divergence .
Figure 7

The duo Jensen divergence is greater than the Jensen divergence for .

The α-skewed Bhattacharyya divergence for where In [30], it is reported that Indeed, using the first-order Taylor expansion of when , we check that we have Thus we have . Moreover, we have Similarly, we can prove that which can be reinterpreted as

6. Concluding Remarks

We considered the Kullback–Leibler divergence between two parametric densities and belonging to truncated exponential families [7] and , and we showed that their KLD is equivalent to a duo Bregman divergence on swapped parameter order (Theorem 1). This result generalizes the study of Azoury and Warmuth [13]. The duo Bregman divergence can be rewritten as a duo Fenchel–Young divergence using mixed natural/moment parameterizations of the exponential family densities (Definition 1). This second result generalizes the approach taken in information geometry [15,35]. We showed how to calculate the Kullback–Leibler divergence between two truncated normal distributions as a duo Bregman divergence. More generally, we proved that the skewed Bhattacharyya distance between two parametric densities of truncated exponential families amounts to a duo Jensen divergence (Theorem 2). We showed asymptotically that scaled duo Jensen divergences tend to duo Bregman divergences generalizing a result of [30,33]. This study of duo divergences induced by pair of generators was motivated by the formula obtained for the Kullback–Leibler divergence between two densities of two different exponential families originally reported in [23] (Equation (29)). It is interesting to find applications of the duo Fenchel–Young, Bregman, and Jensen divergences beyond the scope of calculating statistical distances between truncated exponential family densities. Note that in [36], the authors exhibit a relationship between densities with nested supports and quasi-convex Bregman divergences. However, those considered parametric densities are not exponential families since their supports depend on the parameter. Recently, Khan and Swaroop [37] used this duo Fenchel–Young divergence in machine learning for knowledge-adaptation priors in the so-called change regularizer task.
Table 1

Canonical decomposition of the Poisson and the geometric discrete exponential families.

QuantityPoisson Family PGeometric Family Q
support N{0} N{0}
base measurecounting measurecounting measure
ordinary parameterrate λ>0success probability p(0,1)
pmf λxx!exp(λ) (1p)xp
sufficient statistic tP(x)=x tQ(x)=x
natural parameter θ(λ)=logλ θ(p)=log(1p)
cumulant function FP(θ)=exp(θ) FQ(θ)=log(1exp(θ))
FP(λ)=λ FQ(p)=log(p)
auxiliary term kP(x)=logx! kQ(x)=0
moment η=E[t(x)] η=λ η=eθ1eθ=1p1
negentropy FP*(η(λ))=λlogλλ FQ*(η(p))=11plog(1p)+logp
(F*(η)=θ·ηF(θ))
  3 in total

1.  Divergence function, duality, and convex analysis.

Authors:  Jun Zhang
Journal:  Neural Comput       Date:  2004-01       Impact factor: 2.026

2.  On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius.

Authors:  Frank Nielsen
Journal:  Entropy (Basel)       Date:  2021-04-14       Impact factor: 2.524

Review 3.  An Elementary Introduction to Information Geometry.

Authors:  Frank Nielsen
Journal:  Entropy (Basel)       Date:  2020-09-29       Impact factor: 2.524

  3 in total
  1 in total

1.  A Generic Formula and Some Special Cases for the Kullback-Leibler Divergence between Central Multivariate Cauchy Distributions.

Authors:  Nizar Bouhlel; David Rousseau
Journal:  Entropy (Basel)       Date:  2022-06-17       Impact factor: 2.738

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.