Ali Taylan Cemgil1. 1. Department of Computer Engineering, Boğaziçi University, 34342 Bebek, Istanbul, Turkey.
Abstract
We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.
We describe nonnegative matrix factorisation (NMF) with a Kullback-Leibler (KL) error measure in a statistical framework, with a hierarchical generative model consisting of an observation and a prior component. Omitting the prior leads to the standard KL-NMF algorithms as special cases, where maximum likelihood parameter estimation is carried out via the Expectation-Maximisation (EM) algorithm. Starting from this view, we develop full Bayesian inference via variational Bayes or Monte Carlo. Our construction retains conjugacy and enables us to develop more powerful models while retaining attractive features of standard NMF such as monotonic convergence and easy implementation. We illustrate our approach on model order selection and image reconstruction.
In machine learning, nonnegative matrix factorisation
(NMF) was introduced by Lee and Seung [1] as an alternative to k-means clustering and principal
component analysis (PCA) for data analysis and compression (also see [2]). In NMF, given a W × K nonnegative
matrix X = {x
}, where ν = 1 : W, τ = 1 : K, we seek positive matrices T and V such thatwhere i = 1 : I. We will refer to the W × I matrix T as the template matrix, and I × K matrix V the excitation matrix. The key property of NMF
is that T and V are constrained
to be positive matrices. This is in contrast with PCA, where there are no
positivity constraints or k-means clustering where each column of V is constrained
to be a unit vector. Subject to the positivity constraints, we seek a solution
to the following minimisation problem:Here, the function D is a suitably
chosen error function. One particular choice for D, on which we will focus here, is the information
(Kullback-Leibler) divergence, which we write asUsing Jensen's inequality
[3] and concavity of log x, it can be shown that D(·) is nonnegative
and D(X||Λ) = 0 if and only if X = Λ. The objective in (2) could be minimised by any
suitable optimisation algorithm. Lee and Seung [1] have proposed a very efficient variational bound
minimisation algorithm that has attractive convergence properties and which has
been successfully applied in various applications in signal analysis and source
separation, for example, [4-6].The interpretation of NMF, like singular value
decomposition (SVD), as a low rank matrix approximation is sufficient for the
derivation of a useful inference algorithm; yet this view arguably does not
provide the complete picture about assumptions underlying the statistical
properties of X. Therefore, we describe NMF from a statistical
perspective as a hierarchical model. In our framework, the original nonnegative
multiplicative update equations of NMF appear as an expectation-maximisation
(EM) algorithm for maximum likelihood estimation of a conditionally Poisson
model via data augmentation.
Starting from this view, we develop Bayesian extensions that facilitate more
powerful modelling and allow more sophisticated inference, such as Bayesian
model selection. Inference in the resulting models can be carried out easily
using variational (structured mean field) or Markov Chain Monte Carlo (Gibbs
sampler). The resulting algorithms outperform existing NMF strategies and open
up the way for a full Bayesian treatment for model selection via computation of
the marginal likelihoods (the evidence), such as estimating the dimensions of
the template matrix or regularising overcomplete representations via automatic
relevance determination.
2. The Statistical Perspective
The
interpretation of NMF as a low-rank matrix approximation is sufficient for the
derivation of an inference algorithm; yet this view arguably does not provide
the complete picture. In this section, we describe NMF from a statistical
perspective. This view will pave the way for developing extensions that
facilitate more realistic and flexible modelling as well as more sophisticated
inference, such as Bayesian model selection.Our first step is the derivation of the information
divergence error measure from a maximum likelihood principle. We consider the
following hierarchical model:
Here, 𝒫𝒪(s; λ) denotes the Poisson
distribution of the random variable s ∈ ℕ
0 with
nonnegative intensity parameter λ, whereand Γ(s + 1) = s! is the gamma
function. The priors p(T ∣ ·) and p(V ∣ ·) will be
specified later. We call the variables S
= {s
} latent sources. We can analytically
marginalise out the latent sources S = {S
1 ⋯ S
} to obtain the
marginal likelihoodThis result follows from the
well-known superposition property of Poisson random variables [7], namely, when s
∼ 𝒫𝒪(s
; λ
) and x = s
1 + s
2 + ⋯ + s
, then the
marginal probability is given by p(x) = 𝒫𝒪(x; ∑
λ
). The maximisation of this
objective in T and V is equivalent
to the minimisation of the information divergence in (3). In the derivation of
original NMF in [8],
this objective is stated first; the S variables are
introduced implicitly later during the optimisation on T and V. In the sequel, we show that this algorithm is
actually equivalent to EM, ignoring the priors p(T ∣ ·) and p(V ∣ ·).
2.1. Maximum Likelihood and the EM Algorithm
The
log-likelihood of the observed data X can be written
aswhere q(S) is an
instrumental distribution, that is arbitrary provided that the sum on the right
exists; q can only vanish
at a particular S only when p does so. Note
that this defines a lower bound to the log-likelihood. It can be shown via
functional derivatives and imposing the normalisation condition ∑
q(S) = 1 via Lagrange
multipliers that the lower bound is tight for the exact posterior of the latent
sources, that is,Hence the log-likelihood can be
maximised iteratively as follows:
Here, 〈f(x)〉 = ∫ p(x)f(x)dx, the expectation of some function f(x) with respect to p(x). In the E step, we compute the posterior distribution
of S. This defines a lower bound on the likelihoodFor many models in the
exponential family, which includes (5), the expectation on the right depends on
the sufficient statistics of q(S)( and is readily
available; in fact calculating q(S) should be
literally taken as calculating the sufficient statistics of q(S). The lower bound is readily obtained as a function of
these sufficient statistics, and maximisation in the M Step yields a fixed
point equation.
2.1.1. The E Step
To derive the
posterior of the latent sources, we observe thatFor the model in (5), we haveIt follows from (5), (12), (13),
and (7)where p
≡ t
v
/∑
t
v
are the cell
probabilities. Here, ℳ denotes a
multinomial distribution defined bywhere s = {s
1, s
2,…, s
}, p = {p
1, p
2,…, p
}, and p
1 + p
2 + ⋯ + p
= 1. Here, p
, i = 1 ⋯ I are the cell
probabilities, and x is the index
parameter where s
1 + s
2 + ⋯ + s
= x. The Kronecker delta function is defined by δ(x) = 1 when x = 0, and δ(x) = 0 otherwise. It
is a standard result that the marginal mean isthat is, the expected value
of each source s
is a fraction
of the observation, where the fraction is given by the corresponding cell
probability.
2.1.2. The M Step
It is indeed a
good news that the posterior has an analytic form. Since
now the M step can be calculated easily as follows:Fortunately, for maximisation
with respect to T and V, the last two difficult terms are merely constant,
and we need only to maximise the simpler objectivewhere we only need the expected
value of the sources given by the previous values of the templates and
excitations:Maximisation of the objective Q and substituting 〈s
〉( give the
following fixed point equations: Equation (20) is
identical to the multiplicative update rules of [8]. However, our derivation
via data augmentation obtains the same result as an EM algorithm. It is interesting
to note that in literature, NMF is often described as EM-like; here, we show
that it is actually just an EM algorithm. We see that the efficiency of NMF is
due to the fact that the W × I × K object 〈S〉 needs not to be
explicitly calculated as we only need its marginal statistics (sums across τ or ν).We note that our model is valid when X is integer
valued. See [9] for a
detailed discussion about consequences of this issue. Here, we assume that for
nonnegative real valued , we only consider the integer part, that is, we let , where E is a noise
matrix with entries uniformly drawn in [0, 1). In practice, this is not an obstacle when the
entries of X are large.The interpretation of NMF as a maximum likelihood
method in a Poisson model is mentioned in the original NMF paper [1] and discussed in more
detail by [5, 10]. The equivalence of NMF and
probabilistic latent sematic analysis is shown in [11]. Kameoka in [5] focuses on the optimisation
and gives an equivalent description using auxiliary function maximisation. In
contrast, the auxiliary variables can be viewed as model variables (the sources s) that are
analytically integrated out [10]. A general framework is described in [12]. Prior structures are
placed on conditionally Gaussian NMF models to enforce sparsity in [13]. However, all of these
approaches are based on regularisation, that is, aim at calculating a maximum a
posteriori estimate. In contrast, we provided in this article a full Bayesian
treatment where the templates and excitations are integrated out.
2.2. Hierarchical Prior Structure
Given the
probabilistic interpretation, it is possible to propose various hierarchical
prior structures to fit the requirements of an application. Here we will
describe a simple choice where we have a conjugate prior as follows:Here, 𝒢 denotes the
density of a gamma random variable x ∈ ℝ+ with shape a ∈ ℝ+ and scale b ∈ ℝ+ defined byThe primary motivation for
choosing a Gamma distribution is computational convenience: Gamma distribution
is the conjugate prior to Poisson intensity. The indexing highlights the most
general case where there are individual parameters for each element t
and v
. Typically, we do not allow many free hyperparameters
but tie them depending upon the requirements of an application. See Figure 1
for an example. As an example, consider a model where we tie the
hyperparameters such as a
= a
, b
= b
, a
= a
, and b
= b
for i = 1 ⋯ I, ν = 1 ⋯ W, and τ = 1 ⋯ K. This model is simple to interpret, where each
component of the templates and the excitations is drawn independently from the
Gamma family shown in Figure 2. Qualitatively, the shape parameter a controls the sparsity of the representation. Remember
that 𝒢(x; a, b/a) has the mean b and standard
deviation . Hence, for large a, all coefficients will have more or less the same
magnitude b, and typical representations will be full. In
contrast, for small a, most of the coefficients will be very close to zero,
and only very few will be dominating, hence favouring a sparse representation.
The scale parameter b is adapted to
give the expected magnitude of each component.
Figure 1
(a) A schematic description of the NMF model with data augmentation.
(b) Graphical model with hyperparameters. Each source element s
is Poisson
distributed with intensity t
v
. The observations are given by x
= ∑
s
. In matrix notation, we write X = ∑ S
. We can analytically integrate out over S. Due to superposition property of Poisson
distribution, intensities add up, and we obtain 〈X〉 = TV. Given X, the NMF algorithm is shown to seek the maximum
likelihood estimates of the templates T and excitations V. In our Bayesian treatment, we further assume that
elements of T and V are Gamma
distributed with hyperparameters Θ.
Figure 2
(Left) The family of
densities p(v; a, b = 1) = 𝒢(v; a, b/a) with the same
mean 〈v〉 = b = 1. Small values of a (for a < 1) enforce
sparser representations, and large values of a ≈ 100 tie all values
to be close to a nonzero mean (nonsparse representation).
To model missing data, that is, when some of the x
are not
observed, we define a mask matrix M = {m
}, the same size as X where m
= 0, if x
is missing and 1 otherwise (see
Appendix A.4 for details). Using the mask variables, the observation model with
missing data can be written asThe hierarchical model in (21) is more powerful than
the basic model of (5), in that it allows a lot of freedom for more realistic
modelling. First of all, the hyperparameters can be estimated from examples of
a certain class of source to capture the invariant features. Another
possibility is Bayesian model selection, where we can compare alternative
models in terms of their marginal likelihood. This enables one to estimate the
model order, for example, the optimum number of templates to represent a
source.
3. Full Bayesian Inference
Below, we describe various interesting problems that
can be cast to Bayesian inference problems. In signal analysis and feature
extraction with NMF, we may wish to calculate the posterior distribution of templates and
excitations, given data and hyperparameters Θ ≡ (Θ, Θ). Another important quantity is the marginal
likelihood (also known as the evidence), whereThe marginal likelihood can be
used to estimate the hyperparameters, given examples of a source classor to compare two given models
via Bayes factorsThis latter quantity is particularly useful for
comparing different classes of models. Unfortunately, the integrations required
cannot be computed in closed form. In the sequel, we will describe the Gibbs
sampler and variational Bayes as approximate inference strategies.
3.1. Variational Bayes
We sketch here
the Variational Bayes (VB) [3, 14] method to bound the marginal log-likelihood aswhere q = q(S, T, V) is an
instrumental distribution, and H[q] is its entropy.
The bound is tight for the exact posterior q(S, T, V) = p(S, T, V ∣ X, Θ) but as this
distribution is complex, we assume a factorised form for the instrumental
distribution by ignoring some of the couplings present in the exact posterior
as follows:where ∈ 𝒞 = {{S}, {T}, {V}} denotes set of
disjoint clusters. Hence, we are no longer guaranteed to attain the exact
marginal likelihood ℒ
(Θ). Yet, the bound property is preserved, and the
strategy of VB is to optimise the bound. Although the best q distribution
respecting the factorisation is not available in closed form, it turns out that
a local optimum can be attained by the following fixed point iteration:where q
= q/q
. This iteration monotonically improves the individual
factors of the q distribution,
that is, ℬ[q
(] ≤ ℬ[q
(] for n = 1, 2,… given an
initialisation q
(0). The order is not important for convergence; one
could visit blocks in arbitrary order. However, in general, the attained fixed
point depends upon the order of the updates as well as the starting point q
(0)(·). We choose the following update order in our
derivations:
3.2. Variational Update Equations and Sufficient Statistics
The
expectations of 〈log p(X, S, T, V ∣ Θ)〉 are functions
of the sufficient statistics of q (see the
expression in the Appendix A.2). The update equation for the latent sources
(30) leads to the following: These equations are analogous to
the multinomial posterior of EM given in (14); only the computation of cell
probabilities is different. The excitation and template distributions and their
sufficient statistics follow from the properties of the gamma distribution:
3.3. Efficient Implementation
One of the
attractive features of NMF is easy and efficient implementation. In this
section, we derive that the update equations of Section 3.2 in compact matrix
notation are to illustrate that these attractive properties are retained for
the full Bayesian treatment. A subtle but key point in the efficiency of the
algorithm is that we can avoid explicitly storing and computing the W × I × K object 〈S〉, as we only need the marginal statistics during
optimisation. Consider (33). We can writeHere, the denominator has to be nonzero. In the last line, we
have represented the expression in compact notation where we define the
following matrices:
The matrices
subscripted with t are in ℝ +
and with v are in ℝ +
. For notational convenience, we define. ∗ and ./ as elementwise
matrix multiplication and division, respectively, and 1
as a W × 1 vector of ones.
After straightforward substitutions, we obtain the variational nonnegative matrix factorisation algorithm, that can compactly be expressed as in panel Algorithm 1.
Algorithm 1
Variational
nonnegative matrix factorisation.
Similarly, an iterative conditional modes (ICM)
algorithm can be derived to compute the maximum a posteriori (MAP) solution
(see Appendix A.4):
Note that when the shape parameters
go to zero, that is, A
, A
→ 0, we obtain the maximum likelihood NMF algorithm.
3.4. Markov Chain Monte Carlo, the Gibbs Sampler
Monte Carlo
methods [15, 16] are powerful computational
techniques to estimate expectations of formwhere x
( are independent
samples drawn from p(x). Under mild conditions on f, the estimate converges to
the true expectation for N → ∞. The difficulty here is obtaining independent samples {x
(} from
complicated distributions.The Markov Chain Monte Carlo (MCMC) techniques
generate subsequent samples from a Markov chain defined by a transition kernel
𝒯, that is, one generates x
( conditioned on x
( as follows:Note that the transition kernel 𝒯 is not needed
explicitly in practice; all is needed is a procedure to sample a new configuration, given the previous one. Perhaps
surprisingly, even though subsequent samples are correlated, provided that 𝒯 satisfies
certain ergodicity conditions, (39) remains still valid, and estimated
expectations converge to their true values when number of samples N goes to
infinity [15]. To
design a transition kernel 𝒯 such that the
desired distribution is the stationary distribution, that is, p(x) = ∫ d
x′𝒯(x ∣ x′)p(x′), many alternative strategies can be employed
[16]. One particularly
convenient and simple procedure is the Gibbs sampler where one samples each
block of variables from full
conditional distributions. For the NMF model, a possible Gibbs sampler is Note that this procedure
implicitly defines a transition kernel 𝒯(· ∣ ·). It can be shown [15] that the stationary distribution of 𝒯 is the exact
posterior p(S, T, V ∣ X, Θ). Eventually, the Gibbs sampler converges regardless
of the order that the blocks are visited, provided that each block is visited
infinitely often in the limit n → ∞. However, the rate of convergence is very difficult
to assess as it depends upon the order of the updates as well as the starting
configuration (T
(0), V
(0), S
(0)). It is instructive to contrast above (41) with
the variational update of (30)–(32): algorithmically the two approaches are
quite similar. The pseudo-code is given in Algorithm 2.
Algorithm 2
Gibbs sampler for nonnegative matrix factorisation.
3.4.1. Marginal Likelihood Estimation with Chib's Method
The marginal
likelihood can be estimated from the samples generated by the Gibbs sampler
using a method proposed by Chib [17]. Suppose we have run the block Gibbs sampler until
convergence and have N samples as
follows:The marginal likelihood is
(omitting hyperparameters Θ)This equation holds for all points (V, T, S). We choose a point in the configuration space;
provided that the distribution is unimodal, a good candidate is a configuration
near the mode . The numerator in (43) is easy to evaluate. The
denominator isThe first term is full
conditional, so it is available for the Gibbs sampler. The third term isThe second term is trickierThe first term here is full
conditional. However, the original Gibbs run gives us only samples from p(V ∣ X), not . The idea is to run the Gibbs sampler for M further
iterations where we sample from , that is, with S clamped at . The resulting estimate is Chib's method estimates the
marginal likelihood as follows:
4. Simulations
Our goal is to
illustrate our approach in a model selection context. We first illustrate that
the variational approximation to the marginal likelihood is close to the one
obtained from the Gibbs sampler via Chib's method. Then, we compare the quality
of solutions we obtain via Variational NMF and compare them to the original NMF
on a prediction task. Finally, we focus on reconstruction quality in the
overcomplete case where the standard NMF is subject to overfitting.
Model Order Determination
To test our approach, we generate synthetic
data from the hierarchical model in (21) with W = 16, K = 10, and the number of sources I
true = 5. The inference task is to find the correct number of
sources, given X. The hyperparameters of the true model are set to a
= a
= 10, b
= b
= 1, a
= a
= 1, and b
= b
= 100. In the first experiment, the hyperparameters are
assumed to be known and in the second are jointly estimated from data, using
hyperparameter adaptation. We evaluate the marginal likelihood for models with
the number of templates I = 1 ⋯ 10, with the Gibbs sampler using Chib's method and
variational lower bound ℬ via variational
Bayes. We run the Gibbs sampler for MAXITER = 10 000 steps following
a burn-in period of 5000 steps; then we
clamp the sources S and continue
the simulation for a further 10 000 steps to
estimate quantities required by Chib's method. We run the variational algorithm
until convergence of the bound or 10 000 iterations,
whichever occurs first. In Figure 3(a), we show a comparison of the
variational estimate with the average of 5 independent
runs obtained via Chib's method. We observe, that both methods give consistent
results. In Figure 4, we show the lower bound as a function of model order I, where for each I, the bound is optimised independently by jointly
optimising hyperparameters a
, b
, a
, and b
using the
equations derived in the appendix. We observe, that the correct model order can
be inferred even when the hyperparameters are unknown a priori. This is
potentially useful for estimation of model order from real data.
Figure 3
Model selection results. (a) Comparison of
model selection by variational bound (squares) and marginal likelihood
estimated by Chib's (circles) method. The hyperparameters are assumed to be
known. (b) Box-plot of marginal likelihood estimated by Chib's method
using 5000, 10 000, and 10 000 iterations for
burn-in, free, and clamped sampling. The boxes show the lower quartile, median,
and upper quartile values. (c) Model selection by variational bound when
hyperparameters are unknown and jointly estimated.
Figure 4
Model
selection using variational bound with adapted hyperparameters on face data 16 × 16 with I* = 27 (a) and 32 × 32 with I* = 42 (b).
As real data, we use a version of the Olivetti face
image database (K = 400 images of 64 × 64 pixels
available at http://www.cs.toronto.edu/). We
further downsampled to 16 × 16 or 32 × 32 pixels, hence
our data matrix X is 162 × 400 or 322 × 400. We use a model with tied hyperparameters as a
= a
, b
= b
, a
= a
, and b
= b
, where all hyperparameters are jointly estimated. In
Figure 4, bottom, we show results of model order determination for this dataset
with joint hyperparameter adaptation. Here, we run the variational algorithm
for each model order I = 1 ⋯ 100 independently
and evaluate the lower bound after optimising the hyperparameters. The Gibbs
sampler is not found practical and is omitted here. The lower bound behaves as
is expected from marginal likelihood, reflecting the tradeoff between too many
and too few templates. Higher resolution implies more templates, consistent
with our intuition that detail requires more templates for accurate
representation.We also investigate the nature of the representations
(see Figure 5). Here, for each independent run, we fix the values of shape
parameters to (a
, a
) = [(10, 10), (0.1, 0.1), (10, 0.2), (10, 0.5)] and only
estimate b
and b
. This corresponds to enforcing sparse or nonsparse t and v. Each column shows I = 36 templates
estimated from the dataset conditioned on hyperparameters. The middle image is
the same template image above weighted with the excitations corresponding to
the reconstruction (the expected value of the predictive distribution) below.
Here, we clearly see the effect of the hyperparameters. In the first condition (a
, a
) = (10, 10), the prior does not enforces sparsity to the
templates and excitations. Hence, for the representation of a given image,
there are many active templates. In the second condition, we try to force both
matrices to be sparse with (a
, a
) = (0.1, 0.1). Here, the
result is not satisfactory as isolated components of the templates are zeroed,
giving a representation that looks like one contaminated by “salt-and-pepper”
noise. The third condition ((a
, a
) = (10, 0.2)) forces only the excitations to be sparse. Here, we
observe that the templates correspond to some average face images.
Qualitatively, each image is reconstructed using a superposition of a few of
these templates. In the final representation, we enforce sparsity in the
templates but not in the excitations. Here, our estimate finds templates that
correspond to parts of individual face images (eyebrows, lips, etc.). This
solution, intuitively corresponding to a parsimonious representation, also is
the best in terms of the marginal likelihood. With proper initialisation, our
variational procedure is able to find such solutions.
Figure 5
Templates, excitations
for a particular example, and the reconstructions obtained for different
hyperparameter settings. B is the lower bound for the whole dataset.
Prediction
We now compare variational Bayesian NMF with
the maximum likelihood NMF on a missing data prediction task.To illustrate the self regularisation effect, we set
up an experiment in which we select a subset of the face data consisting of 50 images. From
half of the images, we remove the same patch (Figure 6) and predict the missing
pixels. This is a rather small dataset for this task, as we have only 10 images for each
of the 5 different
persons, and half of these images have missing data at the same spot. We
measure the quality of the prediction in terms of signal-to-noise ratio (SNR).
The missing values are reconstructed using the mean of the predictive
distribution X
pred ≡ 〈X〉 = T*V*, where T* and V* are point
estimates of the template and excitation matrix. We compare our variational
algorithm with the classical NMF. For each algorithm, we test two different versions.
The variational algorithms differ in how we estimate T* and V*. In the first variational algorithm, we use a crude
estimate of T* and V* as the mean of
the approximating q distribution.
In the second condition, after convergence of hyperparameters via VB, we
reinitialise T and V randomly and
switch to an ICM algorithm (see (38)). This strategy finds a local mode (T*, V*) of the exact
posterior distribution. In NMF, we test two initialisation strategies: in the
first condition, we initialise the templates randomly; in the second, we set
them equal to the images in the dataset with random perturbations.
Figure 6
Results of a typical run. (a) Example images from the dataset.
(b) Comparison of the reconstruction accuracy of different methods in
terms of SNR (in dB), organised according to the sparseness of the solution.
(c) (from left to right). The ground truth, data with missing pixels. The
reconstructions of VB, VB + ICM, and ML-NMF with two initialisation strategies
(1 = random, 2 = to image).
In Figure 6, we show the reconstruction results for a
typical run, for a model with I = 100 templates. Note
that this is an overcomplete model, with twice as many templates as there are
images. To characterise the nature of the estimated template and excitation
matrices, we use the sparseness criteria [18] of an m × n matrix X, defined as . This measure is 1 when the matrix X has only a
single nonzero entry and 0 when all
entries are equal. We see that the variational algorithms are superior in this
case in terms of SNR as well as the visual quality of the reconstruction. This
is perhaps not surprising, since with maximum likelihood estimation; if the
model order is not carefully chosen, generalisation performance is poor: the
“noise” in the observed data is fitted but the prediction quality drops on
new data. An interesting observation is that highly sparse solutions (either in
templates or excitations) do not give the best result; the solution that
balances both seems to be the best in this setting. This example illustrates
that sparseness in itself may not be necessarily a good criteria to optimise;
for model selection, the marginal likelihood should be used as the natural
quantity.On the same face dataset, we compare the prediction
error in terms of the SNR for varying model order I. Our goal is to compare the prediction performance of
the full Bayesian approach with the ML-NMF for a range of conditions
(under-complete, complete, and overcomplete). The results shown in Figure 7 are
averages of several runs with hyperparameter adaptation and different
hyperparameter tying. In the simulations, the shape parameters are tied always
as a
= a
(and a
= a
). The scale
parameters are untied or tied as (b
, b
) (across
sources) or b
, b
(different for
each source) and jointly optimised. Regardless of the hyperparameter tying
structure, the results were quite similar. The best SNR values are attained
with untied scale parameters for both excitations and templates.
Figure 7
Average SNR results for
model orders I = 1, 25, 50, 75, 100 covering
undercomplete, complete, and overcomplete cases. Comparison of VB, VB + ICM, and
ML-NMF with two initialisation strategies (1 = random, 2 = to image).
We observe that, due to the implicit
self-regularisation in the Bayesian approach, the prediction performance is not
very sensitive to the model order and is immune to overfitting. In contrast,
the ML-NMF with random initialisation is prone to overfitting, and prediction
performance drops with increasing model order. Interestingly, when we
initialise the ML-NMF algorithm to true data points with small perturbations,
the prediction performance in terms of SNR improves. Note that this strategy
would not be possible for data where the pixels were truly missing. However,
visual inspection shows that the interpolation can still be “patchy” (see
Figure 6).We observe that hyperparameter adaptation is crucial
for obtaining good prediction performance. In our simulations, results for VB
without hyperparameter adaptation were occasionally poorer than the ML
estimates. Good initialisation of the shape hyperparameters seems to be also
important. We obtain best results when initialising the shape hyperparameters
asymmetrically, for example, a
< 1 and a
> 10 (see 3rd and
4th panels from left in Figure 5). When the shape hyper-parameters are initialised
to small a
, a
≪ 1, the EM seems to get stuck in a local minima more
often. Consequently, the prediction results are poorer. We have also carried
out tests with more undercomplete representations when the model order is low I < 10. For these simulations, while the marginal likelihood
was in favour of the VB solutions, we have not observed statistically
significant differences between VB and ML in terms of SNR. The SNR improvement
of VB over ML was on average about 0.1 dB only.
5. Discussion and Conclusions
In this paper, we have investigated KL-NMF from a
statistical perspective. We have shown that KL minimisation formulation the
original algorithm can be derived from a probabilistic model where the
observations are superposition of I independent
Poisson-distributed latent sources. Here, the template and excitation matrices
turn out to be latent intensity parameters. The interpretation of NMF as a
maximum likelihood method in a Poisson model is mentioned in the original NMF
paper [1] and
discussed in more detail by [5, 10], and [5] focuses on the optimisation and gives an equivalent
description using auxiliary function maximisation. In contrast, [10] illustrates that the
auxiliary variables can be viewed as model variables (the sources s) that are
analytically integrated out. The relationship between KL divergence and the
Poisson distribution is not just a lucky coincidence. There exists a duality
between divergence functions and exponential family distributions. If a cost
function is a Bregman divergence, there exists a regular exponential family
where minimising the cost corresponds to maximum likelihood parameter
estimation [19]; also
see [12] it in the
context of matrix factorisation models.The novel observation in the current article is the
exact characterisation of the approximating distribution q(S) or full
conditionals p(S ∣ T, V, X) as a product of
multinomial distributions, leading to a richer approximation distribution than
a naive mean field or single site Gibbs (which would freeze due to
deterministic p(X ∣ S)). This
conjugate form leads to significant simplifications in full Bayesian
integration. Apart from the conditionally Gaussian case, NMF with KL objective
seems to be unique in this respect. For several other distance metrics D(·||·), we find that full Bayesian inference not as
practical as p(S ∣ T, V, X) is not
standard.We have also shown that the standard KL-NMF algorithm
with multiplicative update rules is in fact an EM algorithm with data
augmentation. Extending upon this observation, we have developed an
hierarchical model with conjugate Gamma priors. We have developed a variational
Bayes algorithm and a Gibbs sampler for inference in this hierarchical model.
We have also developed methods for estimating the marginal likelihood for model
selection. This is an additional feature that is lacking in existing NMF
approaches with regularisation, where only MAP estimates are obtained, such as
[13, 18, 20].Our simulations suggest that the variational bound
seems to be a reasonable approximation to the marginal likelihood and can guide
model selection for NMF. The computational requirements are comparable to the
ML-NMF. A potentially time-consuming step in the implementation of the
variational algorithm is the evaluation of the Ψ function but
this step can also be replaced by a simple piecewise polynomial approximation
since exp(Ψ(x)) ≈ x − 0.5 for x > 5.We first compare the variational inference with a
Gibbs sampler. In our simulations, we observe that both algorithms give
qualitatively very similar results, both for inference of templates and
excitations as well as model order selection. We find the variational approach
somewhat more practical as it can be expressed as simple matrix operations,
where both the fixed point equations as well as the bound can be compactly and
efficiently implemented using matrix computation software. In contrast, our
Gibbs sampler is computationally more demanding, and the calculation of
marginal likelihood is somewhat more tricky. With our implementation of both
algorithms, the variational method is faster by a factor of around 13. Reference implementations of both algorithms in
Matlab are available from the following url: http://www.cmpe.boun.edu.tr/.In terms of computational requirements, the
variational procedure has several advantages. First, we circumvent sampling
from multinomial variables, which is the main computational bottleneck with the
Gibbs sampler. Whilst efficient algorithms are developed for multinomial
sampling [21], the
procedure is time consuming when the number of latent sources I is large. In
contrast, the variational method estimates the expected sufficient statistics
directly by elementary matrix operations. Another advantage is hyperparameter
estimation. In principle, it is possible to maximise the marginal likelihood
via a Monte Carlo EM procedure [22, 23]; yet this potentially requires many more iterations.
In contrast, the evaluation of the derivatives of the lower bound is
straightforward and can be implemented without much additional computational
cost.The efficiency of the Gibbs sampler could be improved
by working out the distribution of the sufficient statistics of sources
directly (namely, quantities ∑
s
or ∑
s
) to circumvent
multinomial sampling. Unfortunately, for the sum of binomial random variables
with different cell probability parameters, the sum does not have a simple form
but various approximations are possible [24].Inference based on VB is easy to implement but at the
end of the day, the fixed point iteration is just a gradient-based lower bound
optimisation procedure, and second order Newton methods can provide more
efficient alternatives. For NMF models, there exist many conditional
independence relations, hence the Hessian matrix has a special block structure
[12]. It is certainly
interesting to develop efficient inference methods that make use of the special
block structure of the Hessian matrix. However, as our primary goal was a
practical full Bayesian treatment, we have not investigated this path yet.
Another approach in this direction is using alternative deterministic
integration techniques such as expectation propagation (EP) [25]. Those techniques work
directly on an approximation of the true marginal likelihood rather than a
bound. A related approach known as expectation consistent (EC) inference is
used with success in related source separation problems [26].From a modelling perspective, our hierarchical model
provides some attractive properties. It is easy to incorporate prior knowledge
about individual latent sources via hyperparameters, and one can easily capture
variability in the templates and excitations that is potentially useful for
developing robust techniques. The prior structure here is qualitatively similar
to an entropic prior [20, 27], and we find qualitatively similar representations to
the ones found by NMF reported earlier by [1, 18]. However, none of the above mentioned methods provide
an estimate of the marginal likelihood, which is useful for model selection.
Our generative model formulation can be extended in various ways to suit the
specific needs of particular applications. For example, one can enforce more
structured prior models such as chains or fields [10]. As a second possibility,
the Poisson observation model can be replaced with other models such as clipped
Gaussian, Gamma, or Gaussians which lead to alternative source separation
algorithms. For example, the case of Gaussian sources where the excitations and
templates correspond to the variances is discussed in [28].Our main contribution here is the development of a
principled and practical way to estimate both the optimal sparsity criteria and
model order, in terms of marginal likelihood. By maximising the bound on
marginal likelihood, we have a method where all the hyperparameters can be
estimated from data, and the appropriate sparseness criteria is found
automatically. We believe that our approach provides a practical improvement to
the highly popular KL-NMF algorithm without incurring much additional
computational cost.
Authors: Antonia Godoy-Lorite; Roger Guimerà; Cristopher Moore; Marta Sales-Pardo Journal: Proc Natl Acad Sci U S A Date: 2016-11-23 Impact factor: 11.205
Authors: Christopher Quince; Tom O Delmont; Sébastien Raguideau; Johannes Alneberg; Aaron E Darling; Gavin Collins; A Murat Eren Journal: Genome Biol Date: 2017-09-21 Impact factor: 13.583