Leo L Duan1, David B Dunson2. 1. Department of Statistics, University of Florida, Gainesville, FL 32611, USA. 2. Department of Statistical Science, Duke University, Durham, NC 27708, USA.
Abstract
Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.
Model-based clustering is widely used in a variety of application areas. However, fundamental concerns remain about robustness. In particular, results can be sensitive to the choice of kernel representing the within-cluster data density. Leveraging on properties of pairwise differences between data points, we propose a class of Bayesian distance clustering methods, which rely on modeling the likelihood of the pairwise distances in place of the original data. Although some information in the data is discarded, we gain substantial robustness to modeling assumptions. The proposed approach represents an appealing middle ground between distance- and model-based clustering, drawing advantages from each of these canonical approaches. We illustrate dramatic gains in the ability to infer clusters that are not well represented by the usual choices of kernel. A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data.
Clustering is a primary focus of many statistical analyses, providing a valuable tool for exploratory data analysis and simplification of complex data. In the literature, there are two primary approaches – distance- and model-based clustering. Let , for i = 1, …, n, denote the data and let d(y, y′) denote a distance between data points y and y′. Then, distance-based clustering algorithms are typically applied to the n × n matrix of pairwise distances D( = {d}, with d = d(y, y) for all i, j pair and (n) = {1, …, n}. For reviews, see Jain (2010); Xu and Tian (2015). In contrast, model-based clustering takes a likelihood-based approach in building a model for the original data y( that has the form:
where π = (π1, …, π)′ is a vector of probability weights in a finite mixture model, h is a cluster index, and is the density of the data within cluster h. Typically, is a density in a parametric family, such as the Gaussian, with θ denoting the parameters. The finite mixture model (1) can be obtained by marginalizing out the cluster index c ∈ {1, …, k} in the following model:
Using this data-augmented form, one can obtain maximum likelihood estimates of the model parameters π and θ = {θ} via an expectation-maximization algorithm (Fraley and Raftery, 2002). Alternatively, Bayesian methods are widely used to include prior information and characterize uncertainty in the parameters. For reviews, see Bouveyron and Brunet-Saumard (2014) and McNicholas (2016).Distance-based algorithms tend to have the advantage of being relatively simple conceptually and computationally, while a key concern is the lack of characterization of uncertainty in clustering estimates and associated inferences. While model-based methods can address these concerns by exploiting a likelihood-based framework, a key disadvantage is a large sensitivity to the choice of kernel . Often, kernels are chosen for simplicity and computational convenience, and they place rigid assumptions on the shape of the clusters, which are not justified by the applied setting being considered.We are not the first to recognize this problem, and there is literature attempting to address issues with kernel robustness in model-based clustering. One direction is to choose a flexible class of kernels, which can characterize a wide variety of densities. For example, one can replace the Gaussian kernel with one that accommodates asymmetry, skewness and/or heavier tails (Karlis and Santourian (2009); Juárez and Steel (2010); O’Hagan et al. (2016); Gallaugher and McNicholas (2018); among others). A related direction is to nonparametrically estimate the kernels specific to each cluster, while placing minimal constraints for identifiability, such as unimodality and sufficiently light tails. This direction is related to the mode-based clustering algorithms of Li et al. (2007); see also Rodríguez and Walker (2014) for a Bayesian approach using unimodal kernels. Unfortunately, as discussed by Hennig et al. (2015), a kernel that is too flexible leads to ambiguity in defining a cluster and identifiability issues: for example, one cluster can be the union of several clusters that are close. Practically, such flexible kernels demand a large number of parameters, leading to daunting computation cost.A promising new strategy is to replace the likelihood with a robust alternative. Coretto and Hennig (2016) propose a pseudo-likelihood based approach for robust multivariate clustering, which captures outliers with an extra improper uniform component. Miller and Dunson (2018) propose a coarsened Bayes approach for robustifying Bayesian inference and apply it to clustering problems. Instead of assuming that the observed data are exactly generated from (1) in defining a Bayesian approach, they condition on the event that the empirical probability mass function of the observed data is within some small neighborhood of that for the assumed model. Both of these methods aim to allow small deviations from a simple kernel. It is difficult to extend these approaches to data with high complexity, such as clustering multiple time series, images, etc.We propose a new approach based on a Bayesian model for the pairwise distances, avoiding a complete specification of the likelihood function for the data y(. There is a rich literature proposing Bayesian approaches that replace an exact likelihood function with some alternative. Chernozhukov and Hong (2003) consider a broad class of such quasi-posterior distributions. Jeffreys (1961) proposed a substitution likelihood for quantiles for use in Bayesian inference; also refer to Dunson and Taylor (2005). Hoff (2007) proposed a Bayesian approach to inference in copula models, which avoids specifying models for the marginal distributions via an extended rank likelihood. Johnson (2005) proposed Bayesian tests based on modeling frequentist test statistics instead of the data directly. These are just some of many examples.Our proposed Bayesian distance clustering approach gains some of the advantages of model-based clustering, such as uncertainty quantification and flexibility, while significantly simplifying the model specification task. There is a connection between our approach and nonnegative matrix factorization (NMF) methods (Kuang et al., 2012; Zhao et al., 2015; Kuang et al., 2015). Certain NMF algorithms can be viewed as fast approximations to our likelihood-based approach. Our major contributions are: (i) establishing a novel link between model- and distance-based frameworks, (ii) introducing a principled choice for assigning kernels for distances (equivalent to the affinity/similarity score in NMFs), and (iii) providing a way to calibrate the parameters within the proposed probabilistic framework.
Partial likelihood for distances
Motivation for partial likelihood
Suppose that data y( are generated from model (1) or equivalently (2). We focus on the case in which . The conditional likelihood of the data y( given clustering indices c( can be expressed as
where we let denote the density of data within cluster h, and is the data in cluster h. Since the information of c( is stored by the index with [h], we will omit c( in the notation when [h] appears. Referring to as the seed for cluster h, we can express the likelihood L(y[) using the change-of-variables :
where denotes the difference vector between and (with G the transformed kernel), and the second line is an equivalent factorization of the joint distribution, with the conditional density of given the differences. Expression (4) is a product of the densities of the seed and (n − 1) differences. As the cluster size n increases, the relative contribution of the seed density will decrease and the likelihood becomes dominated by G. With this heuristic justification, we discard the term by treating the value of as random and integrating out .We now use a toy example to illustrate how to derive the function G from a known model-based likelihood (later we will show how to specify G directly, when the model-based likelihood is not known). Consider the case of from a Gaussian mixture, starting from the likelihood of those y’s associated with c = h:
To obtain G, based on the formula , we use the change-of-variable , and integrate out [as the information of is now in ]:
where (a) is due to .In the above example, note that the right hand side appears to be a product density of all the pairwise differences , besides those formed with the seed. This is due to the linear equality that — therefore, there are effectively only (n − 1) free random variables; once they are given, the rest are completely determined.Generalizing from this form, we now specify G as
where and each is assigned a marginal density. The power 1/n is a calibration parameter that adjusts the order discrepancy between the numbers of (n − 1)n/2 marginal densities and (n − 1) effective random variables. We will formally justify this calibration in the theory section.Remark 1
To clarify, despite the simple product form, (5) should not be treated as the independent densities of n(n −1)/2 differences. This is because these differences contain effectively (n −1) random variables
, and (n − 1)(n − 2)/2 interaction terms
; these interaction terms induce dependence between
and
:
For example, for
and
, the related terms in (5) are:
which is a non-separable function of
and
, and hence not independent.We now state the assumptions that we use for clustering.Assumption 1
For those data within a cluster,
and
are independent and identically distributed.Focusing on marginally specifying each g, we can immediately obtain two key properties of : (1) Expectation zero, and (2) Marginal symmetry with skewness zero. Hence, the distribution of the differences is substantially simpler than the original data distribution . This suggests using G for clustering will substantially reduce the model complexity and improve robustness.We connect the density of the differences to a likelihood of ‘distances’ — here used as a loose notion including metrics, semi-metrics and divergences. Consider d ∈ [0, ∞) as a transform of , such as some norm (e.g. Euclidean or 1-norm); hence, a likelihood for d is implicitly associated to a pushforward measure from the one on (assuming a measurable transform). For example, an exponential density on can be taken as the result of assigning a multivariate Laplace on . We can further generalize the notion of difference from subtraction to other types, such as ratio, cross-entropy, or an application-driven specification (Izakian et al., 2015).To summarize, this motivates the practice of first calculating a matrix of pairwise distances, and then assigning a partial likelihood for clustering. For generality, we slightly abuse notation and replace the difference array with the distance matrix D in (5), and denote the density by G(D[). We will refer to (5) as the distance likelihood from now on. Conditional on the clustering labels,
with independently, as is (2).To provide some intuition about the clustering uncertainty, we simulate two clusters using the bivariate skewed Gaussian distribution: in each dimension, the first cluster has a skewness of 4 (red in Figure 1, left panel), location of 0 and scale of 1, and the second has a skewness of −2, location of 0.5 and scale of 1 (grey in Figure 1, left panel); the two sub-coordinates are generated independently. Via the skew Gaussian density used to generate the data, we can compute the oracle assignment probability pr(c = h | y) for h = 1 and 2, and the most likely cluster assignment for each data point.
Figure 1:
Illustration of the clustering uncertainty and its estimation using the distance-based clustering method. Left panel: two overlapping clusters (red and grey) generated from two skew Gaussian distributions with n = 400. Center panel: the oracle uncertainty calculated based on the generative distribution. Right panel: the matrix of the co-assignment probabilities pr(c = c | D) estimated using the distance likelihood of D.
Clearly, due to the overlapping of the two clusters, there is a significant amount of uncertainty for those near the location (0, 0), as the remains away from 0 (Figure 1, center panel) — importantly, such uncertainty does not vanish even as n → ∞, as these points will remain nearly equidistant to the two cluster centers. Using the distance likelihood on D, we can obtain an easy quantification of the uncertainty, by sampling c from the posterior distribution and calculating the co-assignment probabilities pr(c = c | D); as shown in the right panel, the off-diagonal block shows that a significant portion of data that can be co-assigned to either the first or the second cluster with a non-trivial probability.
Choosing a distance density for clustering
To implement our Bayesian distance clustering approach, we need a definition of clusters, guiding us to choose a parametric form for g(.) in (5). For conciseness, we will focus on the norm-based distances from now on. A popular intuition for a cluster is a group of data points, such that most of the distances among them are relatively small. That is, the probability of finding large distances within a cluster should be low. We now state the assumption.Assumption 2
With σ > 0, a scale parameter and ϵ
a function that rapidly declines towards 0 as t increases.
For such a decline, it is common to consider the sub-exponential rate (Wainwright, 2019), with some constant b > 0. Ideally, we want σ to be small, so that most of the distances within a cluster are small.It is tempting to consider using a simple exponential density Exp(σ) for modeling , however, we make an important observation here: the exponential distribution has a deterministic relationship between the mean σ and the variance — this means any slightly large (such as when the distribution of does not follow a exponential decay near zero) will inflate the estimate of σ, making it difficult to use small distances for clustering.This motivates us to use a two-parameter distribution instead — in this article, we use Gamma (α, σ) for g in (5), whose variance is no longer completely determined by the mean ασ.
We defer the prior choice for α and σ to a later section. For now, we verify that the Gamma distribution does have a sub-exponential tail that satisfies Assumption 1.Lemma 2
(Bound on the right tail) If d has the density (8), for any α
≥ 1 and t > 0,
where
.Remark 3
The polynomial term .The assumption on the distances are connected to some implicit assumptions on the data distribution . As such a link varies with the specific form of the distance, we again focus on the vector norm of subtraction , with and q ≥ 1. We now show that a sub-exponential tail for the vector norm distance is a necessary result of assuming sub-exponential tails in .Lemma 4
(Tail of vector norm distance) If there exist bound constants , , such that for all j = 1, …, p
then, there exist another two constants ν, b > 0, such that for any q ≥ 1Remark 5
The concentration property (9) is less restrictive than common assumptions on the kernel in a mixture model, such as Gaussianity, log-concavity or unimodality.
Hyper-prior specification for α and σ
In Bayesian clustering, it is useful to choose the prior parameters in a reasonable range (Malsiner-Walli et al., 2017). Recall in our gamma density, α determines the mean for at ασ. To favor small values for the mode while accommodating a moderate degree of uncertainty, we use a Gamma prior α ~ Gamma(1.5, 1.0).To select a prior for σ, we associate it with a pre-specified maximum cluster number k. We can view k as a packing number — that is, how many balls (clusters) we can fit in a container of the data. To formalize, imagine a p-dimensional ellipsoid in enclosing all the observed data. The smallest volume of such an ellipsoid is
which can be obtained via a fast convex optimization algorithm (Sun and Freund, 2004), with and .If we view each cluster as a high-probability ball of points originating from a common distribution, then the diameter — the distance between the two points that are farthest apart — is ~ 4σ. This is calculated based on pr(d ≤ 4σ) ≈ 0.95 using the gamma density with shape 1.5 (the prior mean of α). We denote the ball by , with .Setting k to the packing number
yields a sensible prior mean for σ. For conjugacy, we choose an inverse-gamma prior for σ with ,The above prior can be used as a default in broad applications, and does not require tuning to each new application.
Theory
We describe several interesting properties for the distance likelihood.
Calibration
Lemma 6 (Exchangeability) When the product density (5) is used for all G(D[), h = 1, …, k, the distance likelihood (6) is invariant to permutations of the indices i:
with (n*) = {1, …, n} denoting a set of permuted indices.We fill a missing gap between the model-based and distance likelihoods by considering an information-theoretic analysis of the two clustering approaches. This also leads to a principled choice of the power 1/n in (5).To quantify the information in clustering, we first briefly review the concept of Bregman divergence (Bregman, 1967). Letting be a strictly convex and differentiable function, with the domain of ϕ, the Bregman divergence is defined as
where ▽ϕ(y) denotes the gradient of ϕ at y. A large family of loss functions, such as squared norm and Kullback-Leibler divergence, are special cases of the Bregman divergence with suitable ϕ. For model-based clustering, when the regular exponential family (‘regular’ as the parameter space is a non-empty open set) is used for the component kernel , Banerjee et al. (2005) show that there always exists a re-parameterization of the kernel using Bregman divergence. Using our notation,
where T (y) is a transformation of y, in the same form as the minimum sufficient statistic for θ (except this ‘statistic’ is based on only one data point y); μ is the expectation of T(y) taken with respect to ; ψ, κ and b are functions mapping to (0, ∞).With this re-parameterization, maximizing the model-based likelihood over c( becomes equivalent to minimizing the within-cluster Bregman divergence
We will refer to H as the model-based divergence.For the distance likelihood, considering those distances that can be viewed or re-parameterized as a pairwise Bregman divergence, we assume each in the distance likelihood (5) can be re-written with a calibrating power β > 0 as
with z > 0 the normalizing constant. A distance-based divergence H can be computed asWe now compare these two divergences H and H at their expectations.Lemma 7
(Expected Bregman Divergence) The distance-based Bregman divergence (11) in cluster h has
where the expectation over y[
is taken with respect to .Remark 8
The term inside the expectation on the right hand side is the symmetrized Bregman divergence between (Banerjee et al., 2005). Therefore,
when B(·, ·) is symmetric.There is an order difference of between distance-based and model-based divergences. Therefore, a sensible choice is simply setting β = 1/n. This power is related to the weights used in composite pairwise likelihood (Lindsay, 1988; Cox and Reid, 2004).
Relationship to Graph Cut
It is also interesting to consider the matrix form of the distance likelihood. We use C as an n × k binary matrix encoding the cluster assignment, with C = 1 if c = h, and all other C = 0. Then it can be verified that CTC = diag(n1, …, n). Hence the distance likelihood, with the Gamma density, is
where D is the n × n distance matrix, log is applied element-wise, Σ diag(σ1, …, σ), and Λ = diag(α1
− 1, …, α
− 1). If C contains zero columns, the inverse is replaced by a generalized inverse.One may notice some resemblance of (12) to the loss function in graph partitioning algorithms. Indeed, if we simplify the parameters to α1 = ··· = α = α0 and σ1 = ··· = σ = σ0, then
where A = κ1 − D/σ0 + (α0 − 1) log D can be considered as an adjacency matrix of a graph formed by a log-Gamma distance kernel, with 1 as an n×n matrix with all elements equal to 1; κ a constant so that each A > 0 (since κ enters the likelihood as a constant tr{CTκ1C(CTC)−1} = nκ, it does not impact the likelihood of C). To compare, the popular normalized graph-cut loss (Bandeira et al., 2013) is
which is the total edges deleted because of partitioning (weighted by to prevent trivial cuts). There is an interesting link between (13) and (14).Lemma 9
Considering a graph with weighted adjacency matrix A, the normalized graph-cut loss is related to the negative log-likelihood (omitting constant) (13) viaRemark 10
The difference on the right is often known as the degree-based regularization (with ,
the size of the cluster that data i is assigned to). When the cluster sizes are relatively balanced, we can ignore its effect.Such a near-equivalence suggests that we can exploit popular graph clustering algorithms, such as spectral clustering, for good initiation of C as a warm-start of the Markov chain Monte Carlo algorithm.
Posterior computation
For the posterior computation, Gibbs sampler is easy to derive as it involves updating one element of the parameter at a time, from the full conditional distribution. However, this could lead to a heavy computational cost for our model. To understand this, consider the update step for each c, which involves a draw from the categorical distribution:
where denotes a matrix equal to the current value of C, except replacing the ith row with C = 1 and C = 0 for other j ≠ h. Since G(D; C) involves a matrix inverse term (CTC)−1, the above ratio cannot be simplified to reduce the computational burden. The evaluation cost for each G(D; C) is dominated by the matrix multiplication steps within, hence having an overall cost of . Therefore, iterating over h = 1, …, k and i = 1, …, n will lead to a high cost in one sweep of update.To solve this problem, we instead develop a more efficient algorithm based on the approximate Hamiltonian Monte Carlo (HMC) algorithm. We use a continuous relaxation of each row C (on a simplex vertex) into the interior of the simplex, and denote the relaxation by . This is achieved via a tempered softmax re-parameterization (Maddison et al., 2017)
At small t > 0 and close to 0, if one v is slightly larger than the rest in {v, …, v}, then w will be close to 1, and all the other w′ ’s close to 0. In this article, we use t = 0.1 as a balance between the approximation accuracy and the numeric stability of the algorithm. In addition, we re-parameterize the other parameters using the softplus function , for h = 1, …, k, and and the softmax function (as defined above except with t = 1), where , and are all unconstrained parameters in amenable to the off-the-shelf continuous HMC algorithm.We denote the vectorized parameters by . To sample from posterior distribution β ~ Π(·), the HMC uses an auxiliary momentum variable v and samples from a joint distribution Π(β, v) = Π(β | D)Π(v), where a common choice of Π(v) is the density of N(0, M). Denote U(β) = −log Π(β | D) and K(v) = −log π(v) = v
M−1v/2, which are commonly referred to as the potential energy and kinetic energy respectively. The total Hamiltonian energy function is H(β, v) = U(β) + K(v).At each state (β, v), a new proposal (β*, v*) is generated by simulating Hamiltonian dynamics satisfying the Hamilton’s equations:
Since the exact solution to the above is intractable, we can numerically approximate the temporal evolution using the leapfrog scheme, as described in the following pseudocode.To accelerate the convergence of the Markov chain to stationarity, we first use the BFGS optimization algorithm (implemented in the PyTorch package) to first minimize U(β) and obtain the posterior mode . We then initialize the Markov chain at .A typical choice for the working parameter M is to let it roughly scale with the covariance matrix of the posterior distribution (Neal, 2011). Using , we calculate the observed Fisher information at [the Hessian matrix of U(β) evaluated at , denoted by ], which gives an approximation to the inverse covariance of β. Although it is tempting to set , the matrix inversion of the latter is often costly and ill-conditioned. To avoid this problem, we use a simpler and diagonal parameterization , which shows good empirical performances in all the examples within this article.To run the HMC sampler, we use the No-U-Turn Sampler (NUTS-HMC) algorithm (Hoffman and Gelman, 2014) implemented in the ‘hamiltorch’ package (Cobb and Jalaian, 2020), which also automatically tunes the other two working parameters ϵ and L. After the automatic tuning, the algorithm reaches an acceptance rate close to 70% as commonly desired for good mixing of the Markov chains. To provide some running time, using a quad-core i7 CPU, at n = 1000, the HMC algorithm takes about 20 minutes for running 10, 000 iterations.Remark 11
On the computational cost, the most expensive step in the HMC algorithm is the calculation of the derivative of log G(D; W) with respect to the matrix W, which involves the following form:
where , symmetric , the matrix inversion of the k×k matrix is not costly , running over L leapfrog steps, the computational cost per iteration of HMC is .Potentially, one could instead consider a Gibbs sampling algorithm, using a block-wise update of CT(log D)C and CTDC (instead of a full evaluation of the matrix product) when sampling each row of C. Despite having a similar computing complexity, a strength of HMC is that we can take advantage of the highly parallelized matrix operation on C, which is often faster than the sequential looping over each row of C.In comparison, the parametric/model-based clustering algorithm has a lower cost of O(n), although this often comes with a risk of model misspecification for modern data. Therefore, choosing which class of methods involves a trade-off between computational speed versus model robustness.The posterior samples of CCT give an estimate of the pairwise co-assignment probabilities . To obtain estimates for pr(c = h), we use symmetric simplex matrix factorization (Duan, 2020) on {pr(c = c)} to obtain an n × k matrix corresponding to {pr(c = h)}. For the diagnostics on the convergence, we calculate the autocorrelation (ACF) and the effective sample size (ESS) for each parameter, and we provide some diagnostic plots in the appendix.In this article, for the ease of visualization and interpretation, we use as a measure of the uncertainty on the point estimate . An alternative is to use the variation of information (Wade and Ghahramani, 2018) as a metric between the clusterings, leading to the discrete extension of the posterior mean and credible intervals. The readers can find the method and toolbox within the reference.In addition, in the appendix, we show that the non-negative matrix factorization (NMF) algorithm, if using a calibrated similarity matrix as its input (such as using our distance likelihood), produces an almost indistinguishable result from the Bayesian distance clustering method. On the other hand, if the similarity is set less carefully (such as using the “default” choice in popular machine learning packages), we found a severe sensitivity that leads to over-/under-estimation of the uncertainty (as shown in Panel 4 of Figure 8). Therefore, for the sake of both conciseness and fairness, we choose to not present the NMF results without calibration in the numerical experiments.
Figure 8:
Comparing the performance of the Bayesian distance clustering (BDC) and the nonnegative matrix factorization (NMF) using the symmetric simplex matrix factorization (Zhao et al., 2015; Duan, 2020). We apply the algorithms on the two clusters of skew Gaussian data as generated in Section 2.1, and plot the estimated co-assignment probability matrix for each method. As can be seen from Panels 2 and 3, when NMF is well calibrated in its similarity function (as we defined in (13), with parameters set to the posterior mean estimate from BDC), it produces a result almost indistinguishable from BDC — both are very close to the oracle co-assignment probability (Panel 1). On the other hand, NMF using an uncalibrated similarity produces a result (Panel 4) very different from the oracle.
Numerical experiments
Clustering with skewness-robust distance
As described in Section 2.1, the vector norm-based distance is automatically robust to skewness. To illustrate, we generate n = 200 data from a two-component mixture of skewed Gaussians:
where SN(μ, σ, α) has density π(y | μ, σ, α) = 2f{(y − μ)/σ}F{α(y − μ)/σ}with f and F the density and cumulative distribution functions for the standard Gaussian distribution.We start with p = 1 and assess the performance of the Bayesian distance clustering model under both non-skewed (α1 = α2 = 0, μ1 = 0, μ2 = 3) and skewed distributions (α1 = 8, α2 = 10, μ1 = 0, μ2 = 2). The results are compared against the mixture of Gaussians as implemented in the Mclust package. Figure 3(a,c) show that for non-skewed Gaussians, the proposed approach produces clustering probabilities close to their oracle probabilities, obtained using knowledge of the true kernels that generated the data. When the true kernels are skewed Gaussians, Figure 3(b,d) shows that the mixture of Gaussians gives inaccurate estimates of the clustering probability, whereas Bayesian distance clustering remains similar to the oracle.
Figure 3:
Clustering data from a two-component mixture of skewed Gaussians in . Bayesian Distance clustering (BDC) gives posterior clustering probabilities close to the oracle probabilities regardless of whether the distribution is skewed or not (upper plots in panel c and d), while the mixture of Gaussians fails when the skewness is present (lower plot in panel d).
To evaluate the accuracy of the point estimate , we compute the adjusted Rand index (Rand, 1971) with respect to the true labels. We test under different p ∈ {1, 5, 10, 30}, and repeat each experiment 30 times. The results are compared against model-based clustering using symmetric and skewed Gaussians kernels, using independent variance structure. As shown in Table 1, the misspecified symmetric model deteriorates quickly as p increases. In contrast, Bayesian distance clustering maintains high clustering accuracy.
Table 1:
Accuracy of clustering skewed Gaussians under different dimensions p. Adjusted Rand index (ARI) is computed for the point estimates using variation of information. The average and 95% confidence interval are shown.
p
Bayes Dist. Clustering
Mix. of Gaussians
Mix. of Skewed Gaussians
1
0.80 (0.75, 0.84)
0.65 (0.55, 0.71)
0.81 (0.75, 0.85)
5
0.76 (0.71, 0.81)
0.55 (0.40, 0.61)
0.76 (0.72, 0.80)
10
0.72 (0.68, 0.76)
0.33(0.25, 0.46)
0.62 (0.53, 0.71)
30
0.71 (0.67, 0.76)
0.25 (0.20, 0.30)
0.43 (0.37, 0.50)
Clustering high dimensional data with subspace distance
For high-dimensional clustering, it is often useful to impose the additional assumption that each cluster lives near a different low-dimensional manifold. Clustering data based on these manifolds is known as subspace clustering. We exploit the sparse subspace embedding algorithm proposed by Vidal (2011) to learn pairwise subspace distances. Briefly speaking, since the data in the same cluster are alike, each data point can be approximated as a linear combination of several other data points in the same subspace; hence a sparse locally linear embedding can be used to estimate an n × n coefficient matrix through
where the sparsity of ensures only the data in the same linear subspace can have non-zero embedding coefficients. Afterward, we can define a subspace distance matrix as
where we follow Vidal (2011) to normalize each row by its absolute maximum. We then use this distance matrix in our Bayesian distance clustering method.To assess the performance, we use the MNIST data of hand-written digits of 0 – 9, with each image having p = 28 × 28 pixels. In each experiment, we take n = 10, 000 random samples to fit the clustering models, among which each digit has approximately 1000 samples, and we repeat experiments 10 times. For comparison, we also run the near low-rank mixture model in HDclassif package (Bergé et al., 2012) and spectral clustering based on the p-dimensional vector norm. Our method using subspace distances shows clearly higher accuracy, as shown in Table 2.
Table 2:
Accuracy of clustering MNIST hand-written digit data. Adjusted Rand index (ARI) is computed for the point estimates using variation of information. The average ARI and 95% confidence intervals are shown.
Bayes Dist. Clustering
Spectral Clustering
HDClassif
0.57 (0.54, 0.60)
0.50 (0.48, 0.52)
0.35 (0.31, 0.43)
Clustering brain regions
We carry out a data application to segment the mouse brain according to the gene expression obtained from the Allen Mouse Brain Atlas dataset (Lein et al., 2007). Specifically, the data are in situ hybridization gene expression, represented by expression volume over spatial voxels. Each voxel is a (200μm)3 cube. We take the mid-coronal section of 41 × 58 voxels. Excluding the empty ones outside the brain, they have a sample size of n = 1781. For each voxel, there are records of expression volume over 3241 different genes. To avoid the curse of dimensionality for distances, we extract the first p = 30 principal components and use them as the source data.Since gene expression is closely related to the functionality of the brain, we will use the clusters to represent the functional partitioning, and compare them in an unsupervised manner with known anatomical regions. The voxels belong to 12 macroscopic anatomical regions (Table 4).
Table 4:
Names and voxel counts in 12 macroscopic anatomical structures in the coronal section of the mouse brain. They represent the structural partitioning of the brain.
Anatomical Structure Name
Voxel Count
Cortical plate
718
Striatum
332
Thalamus
295
Midbrain
229
Basic cell groups and regions
96
Pons
56
Vermal regions
22
Pallidum
14
Cortical subplate
6
Hemispheric regions
6
Cerebellum
5
Cerebral cortex
2
For clustering, we use an over-fitted mixture with k = 20 and small Dirichlet concentration parameter α = 1/20. As shown by Rousseau and Mengersen (2011), asymptotically, small α < 1 leads to the automatic emptying of small clusters; we observe such behavior here in this large sample. In the Markov chain, most iterations have 7 major clusters. Table 5 lists the voxel counts at .
Table 5:
Group indices and voxel counts in 7 clusters found by Bayesian Distance Clustering, using the gene expression volume over the coronal section of the mouse brain. They represent the functional partitioning of the brain.
Index
Voxel Count
1
626
2
373
3
176
4
113
5
79
6
39
7
12
Comparing the two tables, although we do not expect a perfect match between the structural and functional partitions, we do see a correlation in group sizes based on the top few groups. Indeed, visualized on the spatial grid (Figure 5), the point estimates from Bayesian distance clustering have very high resemblance to the anatomical structure. Comparatively, the clustering result from the Gaussian mixture model is completely different.
Figure 5:
Clustering mouse brain using gene expression: visualizing the clustering result on the spatial grid of brain voxels. Comparing with the anatomical structure (panel a), Bayesian Distance Clustering (panel c) has a higher similarity than the Gaussian mixture model (panel b). Most of the uncertainty (panel d) resides in the inner layers of the cortical plate (upper parts of the brain).
To benchmark against other distance clustering approaches, we compute various similarity scores and list the results in Table 3. Competing methods include spectral clustering (Ng et al., 2002), DBSCAN (Ester et al., 1996) and HDClassif (Bergé et al., 2012); the first two are applied on the same dimension-reduced data as used by Bayesian distance clustering, while the last one is applied directly on the high dimensional data. Among all the methods, the point estimates of Bayesian Distance Clustering have the highest similarity to the anatomical structure.
Table 3:
Comparison of label point estimates using Bayesian distance clustering (BDC), Gaussian mixture model (GMM), spectral clustering, DBSCAN and HDClassif. The similarity measure is computed with respect to the anatomical structure labels.
BDC
GMM
Spectral Clustering
DBSCAN
HDClassif
Adjusted Rand Index
0.49
0.31
0.45
0.43
0.43
Normalized Mutual Information
0.51
0.42
0.46
0.44
0.47
Adjusted Mutual Information
0.51
0.42
0.47
0.45
0.47
Figure 5(d) shows the uncertainty about the point clustering estimates, in terms of the probability . Besides the area connecting neighboring regions, most of the uncertainty resides in the inner layers of the cortical plate (upper parts of the brain). As a result, the inner cortical plate can be either clustered with the outer layer or with the inner striatum region. Therefore, from a practical perspective, when segmenting the brain according to the functionality, it is more appropriate to treat the inner layers as a separate region.
Discussion
The use of a distance likelihood reduces the sensitivity to the choice of a mixture kernel, giving the ability to exploit distances for characterizing complex and structured data. While we avoid specifying the kernel, one potential weakness is that there can be sensitivity to the choice of the distance metrics. For example, the Euclidean distance tends to produce a more spherical cluster, compared to the weighted Euclidean distance (see appendix). However, our results suggest that this sensitivity is often less than that of the assumed kernel. In many settings, there is a rich literature considering how to carefully choose the distance metric to reflect structure in the data (Pandit and Gupta, 2011). In such cases, the sensitivity of clustering results to the distance can be viewed as a positive. Clustering method necessarily relies on some notion of distances between data points.Another issue is that we give up the ability to characterize the distribution of the original data. An interesting solution is to consider a modular modeling strategy that connects the distance clustering to a post-clustering inference model while restricting the propagation of cluster information in one direction only. Related modular approaches have been shown to be much more robust than a single overarching full model (Jacob et al., 2017).Our concentration characterization of the within-cluster distance based on the vector norm holds for any arbitrary p. On the other hand, high-dimensional clustering is a subtle topic with challenging issues: (i) not all the coordinates in contain discriminative information that is favorable for one particular partition; hence some alternative distances (Vidal, 2011), feature selection (Witten and Tibshirani, 2010), or multi-view clustering (Duan, 2020) may be necessary; (ii) the selection of number of clusters k becomes difficult, and it was recently discovered (Chandra et al., 2020) that the model-based framework could lead to nonsensical results of converging to either one cluster or n clusters even under a correctly specified model, as p → ∞One interesting extension of this work is to combine with the nearest neighbor algorithm — we can choose to ignore the large distances and instead focus on modeling the K smallest ones for each data point — this could also significantly reduce the computing and storage cost, via the sparse matrix computation. One possible model is to replace our Gamma distance density with a two-component mixture: one Gamma component concentrating near zero for modeling small distances, and one Uniform(0, max
d) for handling large distances. Since the uniform density is a constant that does not depend on the specific value of the distance (as long it is in the support of the uniform), it effectively eliminates the influence of large distances in clustering.
Authors: Ed S Lein; Michael J Hawrylycz; Nancy Ao; Mikael Ayres; Amy Bensinger; Amy Bernard; Andrew F Boe; Mark S Boguski; Kevin S Brockway; Emi J Byrnes; Lin Chen; Li Chen; Tsuey-Ming Chen; Mei Chi Chin; Jimmy Chong; Brian E Crook; Aneta Czaplinska; Chinh N Dang; Suvro Datta; Nick R Dee; Aimee L Desaki; Tsega Desta; Ellen Diep; Tim A Dolbeare; Matthew J Donelan; Hong-Wei Dong; Jennifer G Dougherty; Ben J Duncan; Amanda J Ebbert; Gregor Eichele; Lili K Estin; Casey Faber; Benjamin A Facer; Rick Fields; Shanna R Fischer; Tim P Fliss; Cliff Frensley; Sabrina N Gates; Katie J Glattfelder; Kevin R Halverson; Matthew R Hart; John G Hohmann; Maureen P Howell; Darren P Jeung; Rebecca A Johnson; Patrick T Karr; Reena Kawal; Jolene M Kidney; Rachel H Knapik; Chihchau L Kuan; James H Lake; Annabel R Laramee; Kirk D Larsen; Christopher Lau; Tracy A Lemon; Agnes J Liang; Ying Liu; Lon T Luong; Jesse Michaels; Judith J Morgan; Rebecca J Morgan; Marty T Mortrud; Nerick F Mosqueda; Lydia L Ng; Randy Ng; Geralyn J Orta; Caroline C Overly; Tu H Pak; Sheana E Parry; Sayan D Pathak; Owen C Pearson; Ralph B Puchalski; Zackery L Riley; Hannah R Rockett; Stephen A Rowland; Joshua J Royall; Marcos J Ruiz; Nadia R Sarno; Katherine Schaffnit; Nadiya V Shapovalova; Taz Sivisay; Clifford R Slaughterbeck; Simon C Smith; Kimberly A Smith; Bryan I Smith; Andy J Sodt; Nick N Stewart; Kenda-Ruth Stumpf; Susan M Sunkin; Madhavi Sutram; Angelene Tam; Carey D Teemer; Christina Thaller; Carol L Thompson; Lee R Varnam; Axel Visel; Ray M Whitlock; Paul E Wohnoutka; Crissa K Wolkey; Victoria Y Wong; Matthew Wood; Murat B Yaylaoglu; Rob C Young; Brian L Youngstrom; Xu Feng Yuan; Bin Zhang; Theresa A Zwingman; Allan R Jones Journal: Nature Date: 2006-12-06 Impact factor: 49.962