Literature DB >> 34374141

A mathematical comparison of non-negative matrix factorization related methods with practical implications for the analysis of mass spectrometry imaging data.

Melanie Nijs¹, Tina Smets¹, Etienne Waelkens², Bart De Moor¹.

Abstract

RATIONALE: Non-negative matrix factorization (NMF) has been used extensively for the analysis of mass spectrometry imaging (MSI) data, visualizing simultaneously the spatial and spectral distributions present in a slice of tissue. The statistical framework offers two related NMF methods: probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA), which is a generative model. This work offers a mathematical comparison between NMF, PLSA, and LDA, and includes a detailed evaluation of Kullback-Leibler NMF (KL-NMF) for MSI for the first time. We will inspect the results for MSI data analysis as these different mathematical approaches impose different characteristics on the data and the resulting decomposition.
METHODS: The four methods (NMF, KL-NMF, PLSA, and LDA) are compared on seven different samples: three originated from mice pancreas and four from human-lymph-node tissues, all obtained using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS).
RESULTS: Where matrix factorization methods are often used for the analysis of MSI data, we find that each method has different implications on the exactness and interpretability of the results. We have discovered promising results using KL-NMF, which has only rarely been used for MSI so far, improving both NMF and PLSA, and have shown that the hitherto stated equivalent KL-NMF and PLSA algorithms do differ in the case of MSI data analysis. LDA, assumed to be the better method in the field of text mining, is shown to be outperformed by PLSA in the setting of MALDI-MSI. Additionally, the molecular results of the human-lymph-node data have been thoroughly analyzed for better assessment of the methods under investigation.
CONCLUSIONS: We present an in-depth comparison of multiple NMF-related factorization methods for MSI. We aim to provide fellow researchers in the field of MSI a clear understanding of the mathematical implications using each of these analytical techniques, which might affect the exactness and interpretation of the results.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34374141 PMCID： PMC9285509 DOI： 10.1002/rcm.9181

Source DB: PubMed Journal: Rapid Commun Mass Spectrom ISSN： 0951-4198 Impact factor: 2.586

INTRODUCTION

Mass Spectrometry Imaging (MSI) is an exploratory research technique aiming to visualize the spatial distributions of biomolecules present in a slice of tissue. Well‐known MSI analysis techniques include clustering and dimensionality reduction methods, which entail amongst others factorization methods and non‐linear dimensionality reduction (NLDR) techniques. Clustering, such as k‐means, identifies similar pixels based on the distance between their spectra. NLDR methods, such as t‐distributed stochastic neighbour embedding (t‐SNE) and uniform manifold approximation (UMAP), embed the dataset on a high‐dimensional non‐linear manifold. In contrast to these techniques, factorization methods reduce the traditionally very large MSI matrices to a product of two low‐rank matrices, containing the spectral and spatial distributions of the data. Each column‐row combination of this product can be seen as an individual component describing an important section of the data. Because of the simultaneous extraction of the spectral and spatial elements of MSI data and their straight‐forward interpretability, factorization methods have been extensively proven useful over the last decade. A well‐known example of matrix factorization is principal component analysis (PCA). PCA projects the data orthogonally and linearly onto a number K of orthogonal principal components, i.e., the eigenvectors corresponding to the K largest eigenvalues of the matrix. The data is factored in a score matrix and a loadings matrix. The score matrix, which maximizes variation along each principal component, tries to capture as much of the data as possible on the reduced rank‐K subspace. The loadings matrix describes the weights of the original data elements in the new space. PCA is an unconstrained factorization method and does not take into account the non‐negative nature of MSI data. This can be resolved using non‐negative matrix factorization (NMF) instead. NMF, as shown in Figure 1, factorizes the matrix into a spatial and a spectral component by minimization of the Frobenius norm whilst constraining all values to be greater than or equal to zero. NMF has been used extensively for the analysis of MSI data and extensions and variations of this algorithm are still actively being researched. , Besides non‐negativity as an important characteristic of MSI data, MSI data is known to follow a Poisson distribution because of the count nature of the mass spectrometer. With MALDI‐MSI, the tissue under investigation is ionized by means of a laser beam, after which the ions are accelerated through the mass spectrometer until they reach a sensor that measures the number of hits as well as the time between ionization and reaching the sensor. Due to the heterogeneous distribution of molecules (proteins) within the tissue, the counts (or intensity) measured for each m/z ratio can be seen as a random number of counts, allowing the assumption of MSI data being Poisson distributed, as previously claimed by Harn et al and Piehowski et al. , , , Standard NMF minimizes the squared error assuming a Gaussian distribution of the noise, but Lee and Seung's 1999 paper on NMF already proposed the use of Kullback–Leibler divergence (KL‐NMF) instead of least squares as error measure to incorporate Poisson noise for the factorization of face images. Another approach to include the Poisson prior to NMF can be found in the statistical framework: Probabilistic latent semantic analysis (PLSA) or indexing (PLSI) is a topic modelling technique often used for text mining. Here, a word count matrix of documents, a co‐occurrence matrix, will be factorized in rank‐K topics, each with a document‐to‐topic distribution and a word‐to‐topic distribution. It has been shown that maximizing the PLSA likelihood function yields the same objective as minimizing the KL divergence error. Under some conditions the KL‐NMF and PLSA methods are equivalent, but they generally converge to different optima. , In text mining, it is shown that latent Dirichlet allocation (LDA) outperforms PLSA in terms of disambiguation and precision due to the sparse Dirichlet prior. LDA is generally not categorized as a factorization method but as a generative statistical model closely related to PLSA and thus indirectly connected to NMF.

FIGURE 1

Graphical representation of non‐negative matrix factorization (NMF)

Graphical representation of non‐negative matrix factorization (NMF) Besides being non‐negative and Poisson distributed, we also know that MSI data can be modelled as non‐linear. Methods such as t‐SNE and UMAP perform a non‐linear mapping of the data onto a lower dimensional structure and provide very intuitive visualizations because they take the entire feature space into account. One notable draw‐back of these methods is the inability to simultaneously provide a spatial and spectral distribution of the data, making them harder to interpret than linear factorization methods. In this paper, we will compare the results of the following methods: NMF, KL‐NMF, PLSA, and LDA for the application on MSI data analysis. Although there are many more variants of NMF, we limit ourselves to these four because there is a clear resemblance in what they try to obtain, but their mathematics imply fairly different results. All but KL‐NMF have already been investigated for the analysis of MSI data. , , KL‐NMF has been used in the context of MSI in previous work, but has not yet been examined in detail or evaluated against standard NMF. This is likely due to the theoretical resemblance with PLSA, which has also been stated specifically for MSI. However, it has already been proven that both methods differ practically, and the theoretical resemblance only holds under certain circumstances. In general, the selection of a certain algorithm for analysis is rarely substantiated, nor are the implications of a certain framework mentioned. Because MSI data analysis is an exploratory technique, it is important to keep in mind the deviations resulting from the mathematical framework of the chosen algorithm. By comparing the results of different approaches, the correctness of results can be assessed with more certainty. Some frequently used numerical performance measures are the reconstruction error, computation time, and memory usage. Comparison of the non‐linear mapping with the spatial distributions obtained with factorization methods could serve as an additional performance measure. As UMAP was found to be excellent for MSI data analysis, , we will examine whether the factorization method is able to capture the same regions as UMAP, for example. Furthermore, the mathematical complexity, interpretability, and scalability will also be used as evaluation criteria. The next sections provide a mathematical description of the different algorithms, an experimental comparison, and a discussion on the different methods. With this paper we hope to offer a comprehensive overview of the NMF‐related methods mentioned and evaluate their performance on MSI data analysis by means of multiple measures, including the so far not yet compared KL‐NMF method. We aim to provide a clear comparison and understanding of each method, listing their advantages and disadvantages. We will show that KL‐NMF offers the best analysis for our data, when trying to match the obtained distributions of UMAP.

METHODS

Define , an MSI dataset consisting of M rows, describing the measured m/z values, and N columns, describing all the (vectorized) pixels in the sample. We want to extract K non‐negative components that represent the data as well as possible, according to a certain optimization function. The value K is called the rank of the decomposition. One straightforward way to determine the rank‐K of a matrix is to look at the singular values. By taking K equal to the number of singular values of X that are distinctively higher than the other singular values, we get a good estimate for the rank of the non‐noise part of the data. Alternatively, there exist several other, more complex methods for the rank estimation of a matrix as the determination of the rank is a research discipline on its own. Extracting the non‐negative components can be done using a variety of algorithms. Here, we will restrict ourselves to the examination of non‐negative matrix factorization (NMF), Kullback–Leibler NMF (KL‐NMF), probabilistic latent semantic analysis (PLSA), and latent Dirichlet allocation (LDA).

Non‐negative matrix factorization (NMF)

The standard non‐negative matrix factorization decomposes the matrix into two factor matrices: , of which the columns represent the rank‐K spectral basis elements, and , describing the contribution of each basis element to the spectrum at each pixel p . This can be written as: and yields the following optimization problem: where the standard Frobenius norm is used, denoted as. The main advantage of NMF is the easy interpretability of the factor matrices. Using the additive linear combination of the spectral basis elements in , weighted by the contribution matrix the spectral signature of each pixel can be approximated. This way, NMF provides meaningful information, both spectral and spatial, making it a well‐suited method for MSI data analysis. Unfortunately, the NMF problem is NP‐hard (non‐deterministic polynomial‐time hard) and not unique. In most cases, the objective cannot be solved exactly and the optimum needs to be approximated numerically instead. This typically results in only a local optimum. For the numerical approximation, many algorithms exist such as multiplicative update (MU) rules and alternating least‐squares. NMF decomposition, from the scikit‐learn Python package, was configured with a coordinate descent (CD) solver. This is the framework used most for NMF; it optimizes in alternation one of the factor matrices or , while keeping the other one fixed. This will result in a non‐negative least‐squares subproblem with only one unknown, which is convex and thus easier to solve. Because this optimization problem is symmetric in and , the same update can be used for both factor matrices. These updates guarantee a decrease of the objective function. The computational complexity of NMF on most real‐world problems equals O(mnk) operations per iteration.

Kullback–Leibler non‐negative matrix factorization (KL‐NMF)

As mentioned before, there exists a variation on traditional NMF that minimizes the KL divergence. In this case, Poisson distribution is included when factorization is performed, instead of the Gaussian noise implied by standard NMF. For each m/z value i, and for each pixel j,the observed intensity x follows a Poisson distribution P(x ). The matrix is factored in similar matrices and , but the optimization problem now includes the following maximum likelihood function: with rate parameter λ = (WH) . Or, similarly, it minimizes the negative log‐likelihood: with c a constant depending only on , not on and . To solve KL‐NMF, multiplicative update rules (MU rules) are most commonly used. However, similar to standard NMF, the problem is NP‐hard and not unique. The interpretation of the factor matrices is similar but using the KL‐NMF now implies Poisson distribution on the result. This corresponds better than standard NMF to the prior information known about MSI data, but has, as known so far, not yet been examined and only rarely been used for MSI data analysis. In practice, many NMF estimators exist for different stochastic distributions of data. The scikit‐learn Kullback–Leibler extension of NMF uses MU rules with β‐divergence for β = 1 as factorization algorithm, as this is equivalent to the KL divergence. The MU rules are the most widely used approach for KL‐NMF where matrices and are updated with component‐wise multiplications, defined as follows: These rules are easy to implement and scale well. This approach also has a computational complexity of O(mnk).

Probabilistic latent semantic analysis (PLSA)

The probabilistic equivalent of KL‐NMF is PLSA. In essence, both methods have the same theoretical objective, but they use different approaches of the MU rules to achieve the optimum. Also, PLSA, as a statistical approach, will result in probability distributions, while KL‐NMF will consist of values relative to the given dataset. This makes them inherently dissimilar in practice and causes them to converge to different local optima. PLSA originates from text mining, as it is an extension on latent semantic analysis (LSA). In text mining, PLSA observes the correspondence between words (w) in multiple documents and collects the word counts per document in a co‐occurrence matrix. Additionally, in the setting of PLSA, we define as hidden variables the topics (c) across the dataset to which certain words and documents are related. Translating this to MSI, as has been done in previous work, we can see this as the correspondence between the m/z bins (words) and the spectra (documents), located at each pixel entry. The hidden topics, in the case of MSI, can be defined as the different cell types within the tissue. When the number of m/z bins ℓ in the dataset is sufficiently large, PLSA results in the following factorization model: Here, corresponding to the factor P(c) of the previous equation, is a vector describing the probability of a random m/z value being related to the kth topic, matrix , or P(w | c), describes the probability of a certain m/z value contributing to a certain spectrum, and matrix , or P(d | c), the probability of each spectrum belonging to a certain topic or tissue type. As they entail probabilities, the vector and matrices are exposed to sum‐to‐one constraints with respect to the spectra. If we define and , the PLSA model can be seen as an NMF. More specifically, PLSA assumes to be Poisson distributed in terms of . As the stationary points of KL‐NMF preserve the row and column sum of , it can be proved that they correspond to a solution of PLSA. This allows the correspondence of PLSA to KL‐NMF in the local optima, despite the normalization difference between the two formulations. Nevertheless, in practice both methods do differ. The MU rules of KL‐NMF do not take the normalization of PLSA into account explicitly. Moreover, PLSA uses the Expectation–Maximization (EM) algorithm, where it applies a normalization in each step. For a more elaborate mathematical derivation on the correspondences and differences of these two methods, we refer the reader to a more elaborate work. Using the EnsTop Python package, the expectation step calculates the probability of each topic given a word‐document pair: This is followed by the maximization step that uses the result of the expectation step to calculate the estimates P(w| c) and P(c| d): The computational complexity of PLSA equals O(mnk).

Latent Dirichlet allocation (LDA)

Although LDA is not generally thought of as a factorization method, it is a generalization of PLSA and is often considered as the better of the two in the field of text mining, where they both originate from. The main difference between both methods is that PLSA aims to perform an exact modelling of the dataset, while LDA tries to find unknown groupings within the data. LDA is a generative statistical model that uses a Dirichlet prior to impose sparsity on the solution, making it more robust to modelling the noise in the dataset. For MALDI‐TOF MSI, it has been used as a soft clustering technique and, with tandem mass spectrometry (MS/MS), one can use LDA to cluster the fragmentation of the data in substructures, as has been done in previous research. , , , Let us define similarly as for PLSA: N as the number of m/z bins (or words) in the data, K as the number of components (or topics), which can be seen as the different cell‐types within the tissue, and D as the number of spectra or pixels in the MSI dataset (which corresponds to the number of documents in the setting of text mining). Also, let θd be the probability of a spectrum (d) belonging to a cell type (c) according to a Dirichlet prior α, and βc be the probability of a certain m/z bin (w) belonging to a cell type (c) according to another Dirichlet prior η. A pseudo algorithm of the generative LDA method can now be defined as :This process generates a latent structure that can be used for either prediction of new entries or for data exploration. Using the scikit‐learn package for LDA, one can use an efficient batch variational Bayes algorithm to estimate the posterior distribution p(z,θ,β| w,α,η). Using this algorithm, LDA has a computational complexity of O(mnk). For a more detailed description on this algorithm, we refer the reader to the work of Hoffman et al. For each component or cell type c ∈ K, draw βc~Dirichlet(η). For each spectrum d ∈ D, draw θd~Dirichlet(α). For each m/z bin i in document d: Draw the cell type assignment z ~Multinomial(θ ), Draw the observed m/z bin . Note that if we sum over the different topics, LDA can be thought of as a “multinomial PCA” interpretation, i.e., a probabilistic factorization optimizing the squared error.

Evaluation criteria

To compare the methods described above, multiple evaluation criteria are investigated. First, we compare all methods numerically: we will calculate the reconstruction error, the computation time, and the memory usage for each experiment. The reconstruction error will be investigated by means of three different measures: the relative ℓ 1‐norm, the relative ℓ 2‐norm and the KL divergence. We define the relative errors as the norm of the difference between the original and reconstructed data divided by the norm of the original data: with the ℓ ‐norm for a vector x and a real value p ≥ 1 formulated as: The relative KL divergence between and is calculated as: Here, P and Q correspond respectively to the vectorized original data and the vectorized approximation . As has been shown in the previous section, NMF and LDA impose a Gaussian distribution of the noise and KL‐NMF and PLSA a Poisson distribution. Because of these statistical priors, we can assume NMF to perform better on the squared error (ℓ 2‐norm, Frobenius norm) and KL‐NMF and PLSA better on the KL divergence error. To this end, we included a third, unrelated error measure: the ℓ 1‐norm. This norm calculates the error by means of the absolute value, rather than the squared values used by the ℓ 2‐norm; this makes the ℓ 1 reconstruction error more robust towards outliers. Note that for LDA, the comparison using reconstruction errors makes less sense as the model is generative and not aiming to reconstruct the data. We will thus only calculate the KL divergence in the case of LDA, to estimate to what extent the model follows a similar distribution as the given data set. Additionally, we also measure the average computation time and memory usage for each experiment. For each method and for each dataset, we perform the experiment ten times until convergence, or the maximum number of iterations is reached. Because of the size of real MSI datasets, which is expected to increase still further as the technology advances, we monitor time and memory usage as they might reveal practical limitations of the algorithms. Besides numerical performance measures, there are a number of additional criteria that should be investigated as well. First, we will check how well the spatial distributions resemble the solutions obtained by the non‐linear mapping obtained with UMAP. We know these methods are better to capture the non‐linearity of MSI data, and thus could serve as an additional quality check. Secondly, since MSI data is ever increasing in size, sparse representations could offer a solution in the future. This way, the large MSI datasets can be represented more efficiently, by using less memory and less computational power. We want to know how well the examined methods are able to perform with sparse representation, and thus whether they are able to scale up together with the data size. Thirdly, as mentioned before, we want to understand what the meaning is of the results. Depending on the mathematical framework and the used model, interpretation of the components will differ. We will compare each method in terms of their implications on the interpretation. Additionally, the computational complexity and mathematical simplicity of each algorithm will be reviewed. Finally, aside from the mathematical aspects, we will take a closer look at the biological implications for a human‐lymph‐node sample. Using all these numerical and non‐numerical evaluation criteria, we aim to offer a complete comparison of NMF, KL‐NMF, LDA, and PLSA for the analysis of MSI data.

Experimental setup

Data

In order to make proper conclusions on the performance of each method, we executed the experiments on multiple real datasets: mouse pancreas data (M1, M2, and M3) and human‐lymph‐node data (P1, P2, P3, and P4). The datasets were acquired using MALDI TOF, with a focus on the masses 620–1200 Da of lipids for the lymph data, and masses 2–20 kDa of proteins for the pancreatic data. Total‐ion count (TIC) normalization was used for each dataset. Table 1 gives an overview of the specifications of each dataset. For the details on the data acquisition, we refer the reader to the work of Smets et al.

TABLE 1

Specifications of the mouse pancreas data (M1, M2, and M3) and human‐lymph‐node data (P1, P2, P3, and P4). The sparsity of the datasets is defined as , which shows the percentage of zero entries in the matrices

Sample	Pixels	m/z bins	Sparsity	Size (GB)
M1	10 606	14 000	0.030	1.19
M2	14 791	14 000	0.090	1.66
M3	6937	14 000	0.095	0.78
P1	572 173	8000	0.005	36.62
P2	552 701	8000	0.015	33.33
P3	354 444	12 000	0.041	34.03
P4	275 133	12 000	0.020	26.41

Implementation

We examined the four methods (NMF, KL‐NMF, PLSA, and LDA) on the seven tissue samples described above. The experiments were done in Python using the scikit‐learn package for NMF, KL‐NMF, and LDA, and the EnsTop package for PLSA. For fair comparison, all experiments were executed using HPC provided by the Flemish Supercomputer Centre (VSC). Each setup used 36 computing nodes on a 2 Xeon Gold 6140 CPUs@2.3 GHz. All four algorithms require a given number of components for the decomposition, or rank. The meaning of this rank is the same for each of the methods: it is the number of components or cell types we expect to contribute to the data. We determined the rank of the decomposition by means of the singular value decomposition, as explained in section 3.1. Each experiment was run until the update between two iterations reached a threshold below 1e‐3, or until a maximum number of iterations was reached, whichever came first. We set the maximum number of iterations to 400 for NMF, KL‐NMF, and PLSA, and to 20 for LDA. The number of iterations for LDA is much lower because LDA has a much longer computation time per iteration. Finally, we tried to improve the optimization by using a sparsity favoring initialization by using non‐negative double singular value decomposition (NNDSVD) instead of random initialization. This was done for NMF and PLSA. For KL‐NMF, solved with multiplicative update rules, we used the NNDSVD variant that filles the zeros with small random values, such that they can still be updated during the iterations. LDA, being a statistical generative model, uses a prior for the topic word distribution (β) and a prior for the document word distribution (θ). Both were initialized at the default value .

RESULTS

Each algorithm ran multiple times on each of these samples, while measuring the numerical evaluation criteria. The additional non‐numerical criteria are investigated separately.

Rank selection

MSI data contains a significant amount of noise and it is obvious we do not want to model all this noise exactly. We want to select a number of components of the factorization methods equal to the rank of the non‐noise part of the data. Using a singular value decomposition, one can plot the singular values in a semi‐log manner and select a rank equal to the knee in this graph. , , This way, we estimate the number of components based on the variance explained by the singular values. As soon as the magnitude of the singular values stops decreasing significantly, one can expect the rest of the singular values are describing the noise present in the data. For a detailed investigation of a MSI dataset, one can use the SVD as a first guess, but it is advised to inspect neighboring values as well or use an additional method such as the AIC criterion for rank selection. We refer to Ubaru and Saad for further information on rank estimation. Figures 2 and 3 display respectively for the M2 pancreas data and the P1 human‐lymph‐node data, the formation of the knee around the fifth largest singular value. For this, we will decompose the data into five components for both the pancreatic and lymph‐node samples. This choice of rank, k = 5, can be justified by the visualization done with UMAP where one can find five distinct regions for M2 as for P1, see Figure 6.

FIGURE 2

FIGURE 3

Rank selection of the lymph‐node sample P1‐50 largest (right‐to‐left) singular values are plotted on a semi‐log scale. There is a kink in the curve around the fifth highest singular value. A good estimate for the rank of the matrix will thus be K = 5

FIGURE 6

UMAP visualization with cosine distance metric of samples M2 (A) and P1 (B). Obtained from the work of Smets et al reproduced with permission

Rank selection of the pancreas sample M2‐20 largest (right‐to‐left) singular values are plotted on a semi‐log scale. There is a kink in the curve around the fifth highest singular value. A good estimate for the rank of this matrix will thus be K = 5 Rank selection of the lymph‐node sample P1‐50 largest (right‐to‐left) singular values are plotted on a semi‐log scale. There is a kink in the curve around the fifth highest singular value. A good estimate for the rank of the matrix will thus be K = 5

Pancreas data

The first dataset on which we will compare the factorization methods is the mouse pancreatic tissue. We have three samples: M1, M2, and M3 of relatively small size for MSI data, see Table 1. Each sample was decomposed into five components, using NMF, KL‐NMF, PLSA, and LDA. Table 2 shows the average relative ℓ 1‐norm, relative ℓ 2‐norm, KL divergence, computation time, and memory usage after five runs each. In the case of the pancreatic samples, each method reaches convergence before the maximum number of iterations. Note that each method converges to exactly the same value for the multiple runs, making the standard deviation zero for the reconstruction errors. Only the computation time and memory usages fluctuate over multiple runs. We see, as expected, NMF minimizing the relative ℓ 2‐norm and KL‐NMF the KL divergence error. Additionally, KL‐NMF scores also best on the relative ℓ 1‐norm. Here, it is interesting to see that PLSA scores second best on the KL divergence error as we would expect. LDA only scores reasonably well on the KL divergence for the M3 sample.

TABLE 2

Sample	Method	ℓ₁	ℓ₂	KL	Time (s)	Mem. usage (GB)
M1	NMF	0.4516	2.486e‐6	0.2108	2.6 ± 0.0	1.1 ± 1.0
	KL‐NMF	0.3955	3.233e‐6	0.1480	60.9 ± 3.1	6.3 ± 1.4
	PLSA	0.8700	0.1350	0.1658	44.4 ± 0.9	6.6 ± 2.3
	LDA	‐	‐	1.1770	3 970.4 ± 27.7	1.2 ± 0.0
M2	NMF	0.4770	1.838e‐6	0.2578	3.7 ± 0.1	0.8 ± 1.6
	KL‐NMF	0.4366	2.290e‐6	0.1922	88.2 ± 7.2	7.4 ± 1.7
	PLSA	0.8283	0.0909	0.2096	63.9 ± 5.1	6.5 ± 2.4
	LDA	‐	‐	1.5376	4 221.0 ± 29.1	1.7 ± 0.002
M3	NMF	0.5719	1.266e‐5	0.2872	1.6 ± 0.0	0.2 ± 0.4
	KL‐NMF	0.5585	1.442e‐5	0.2782	35.8 ± 4.7	3.6 ± 1.0
	PLSA	0.7465	0.3067	0.2919	27.1 ± 0.3	2.1 ± 1.6
	LDA	‐	‐	0.5658	3 761.2 ± 51.6	0.8 ± 0.002

Numerical analysis of NMF, KL‐NMF, PLSA, and LDA on three mouse pancreatic samples: M1, M2, and M3. For the reconstruction error, three measures are used: The relative ℓ 1 and ℓ 2‐norm, and the Kullback–Leibler divergence. Each experiment ran until convergence in the update was reached, with a threshold of 1e‐3 The computation times differ significantly between the investigated methods. NMF clearly executes the fastest, and LDA is by far the slowest, taking into account we limited the number of iterations to 20. PLSA is a bit faster than KL‐NMF, but of the same magnitude. For the memory usage, NMF is clearly the most memory efficient; this can be explained as NMF reaches the update threshold within only one iteration for all samples. LDA, KL‐NMF, and PLSA are again of the same magnitude, with LDA the most memory efficient of the three. The resulting decompositions can be visualized as a spectral and spatial distribution for each component. Figure 4 shows the obtained results for sample M2. A first glance at these results shows that the PLSA decomposition misses a lot of information in its spatial distributions and that the LDA results show poor spectral distributions.

FIGURE 4

Visual results of the mouse pancreas sample M2. For each method: (A) NMF, (B) KL‐NMF, (C) PLSA, and (D) LDA we have executed a decomposition into five components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows Note that the spectra have different scales for the intensities across all methods and components. With the pancreatic data, most differences between the decompositions are noticeable in the smaller peaks.

Lymph‐node data

The second dataset we will investigate are the human‐lymph‐node samples. We perform the same experiments on the four lymph‐node samples: P1, P2, P3, and P4, with more or less comparable data size, see Table 1. Their sizes are however multitudes greater than the pancreatic samples, affecting as a consequence the peak memory usage significantly. In fact, the lymph‐node data resembles much more the current standard of MSI data with higher spatial resolutions. Again, each sample was decomposed into five components, using NMF, KL‐NMF, PLSA, and LDA. From Table 3 we can deduce similar trends as for the pancreatic samples. Here, as was the case for the pancreatic samples, NMF outperforms the other in minimizing the relative ℓ 2‐norm and the same is true with KL‐NMF for the relative ℓ 1‐norm and the KL divergence error. Looking at the standard deviation for each method, we note that they all keep converging to the same solution. Only for LDA convergence of the update between two iterations was not reached, LDA was stopped after its maximum number of iterations (12) as the amount of computation time, roughly 12 h, was already many times higher those of the other methods.

TABLE 3

Sample	Method	ℓ₁	ℓ₂	KL	Time (s)	Mem. usage (GB)
P1	NMF	0.8042	2.740e‐6	0.5559	120 ± 3	104 ± 2
	KL‐NMF	0.7865	2.869e‐6	0.5365	2 919 ± 133	218 ± 1
	PLSA	4.9300	0.1475	1.1545	2 159 ± 46	422 ± 8
	LDA	‐	‐	1.0710	46 297 ± 811	37 ± 1
P2	NMF	0.6213	1.080e‐7	0.4185	112 ± 8	92 ± 3
	KL‐NMF	0.6065	1.147e‐7	0.3308	2 687 ± 105	196 ± 2
	PLSA	3.0266	0.0778	0.4077	1 754 ± 69	307 ± 15
	LDA	‐	‐	1.2304	42 981 ± 2,415	33 ± 1
P3	NMF	0.6210	3.233e‐7	0.3342	119 ± 12	93 ± 3
	KL‐NMF	0.5958	3.406e‐7	0.3133	3 761 ± 52	197 ± 1
	PLSA	4.2952	0.1926	0.3418	1 856 ± 151	303 ± 15
	LDA	‐	‐	0.9790	42 455 ± 264	35 ± 2
P4	NMF	0.6823	2.557e‐7	0.3899	102 ± 11	73 ± 2
	KL‐NMF	0.6402	2.706e‐7	0.3531	2 096 ± 60	155 ± 1
	PLSA	3.4761	0.1882	0.3899	1 379 ± 120	242 ± 14
	LDA	‐	‐	0.9769	33 897 ± 261	28 ± 2

Numerical analysis of NMF, KL‐NMF, PLSA, and LDA on four human‐lymph‐node samples: P1, P2, P3, and P4. For the reconstruction error, three measures are used: The relative ℓ 1 and ℓ 2‐norm, and the Kullback–Leibler divergence. Each experiment for NMF, KL‐NMF, and PLSA ran until convergence in the update was reached, with a threshold of 1e‐3. For LDA, convergence was not reached within the maximum number of 12 iterations For the computation times, again NMF is by far the fastest, PLSA and KL‐NMF are again of the same order of magnitude and still reasonably fast, with KL‐NMF being the slower of the two, but reaching better reconstruction errors. Memory usage is now the lowest for LDA. Visually, Figure 5 shows the spectral and spatial distributions per method and per component obtained for P1. A gamma‐correction of γ = 0.5 was applied to the 2D visualisations for higher contrast. NMF, KL‐NMF, and PLSA all show clear decompositions, but differ in their details. Each method extracts different contributions of the relevant peaks. Again, as for the pancreatic samples, LDA results in poor spectral representations.

FIGURE 5

Visual results of the human‐lymph‐node sample P1. For each method: (A) NMF, (B) KL‐NMF, (C) PLSA, and (D) LDA we have executed a decomposition into five components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows

Additional evaluation criteria

Aside from the numerical experiments, measuring reconstruction errors, computation time and memory usage, we want to assess the performance of each method more generally as well. Since no criterion is the most designated one for the evaluation of MSI data analysis, we will include several more evaluation criteria in order to obtain a complete review of every method.

Visual inspection using UMAP

As already described in the works of Smets et al and some recent computational review papers22, 24 UMAP‐based data processing is very well suited for the analysis of MSI datasets. Figure 6A, obtained from Smets et al, shows the visualization of a three‐dimensional UMAP embedding with cosine distance of sample M2. From the UMAP embedding, we can distinguish five different regions based on the five distinct colors: pink, blue, black, red, and green. Based on previous research, we know that the pink region corresponds to the Islets of Langerhans, the blue region is linked to exocrine tissue, and the black region to blood vessels. The red and green regions are not very easy to interpret and have not yet been related to an anatomical structure. To this end, we will perform a visual comparison based solely on the pink, blue, and black regions of the UMAP embedding. UMAP visualization with cosine distance metric of samples M2 (A) and P1 (B). Obtained from the work of Smets et al reproduced with permission For all methods, the two most first components give a partition between the islets and the rest of the tissue. This can be explained by the high intensity of insulin (m/z = 5800) in the dataset. For the exocrine tissue, we see a similar component in NMF and KL‐NMF, which is not detected by LDA. PLSA also uncovers the exocrine tissue, but as a less clear image. And lastly the blood vessels are not found in NMF, but are in KL‐NMF, PLSA, and LDA. Note that, for this anatomical component, we see more noise in the spectra than for the others. LDA shows visually rather good distributions but lacks precision in the spectra. For our MALDI dataset, LDA is too general in the decomposition of the spectra. Similarly, we take a look at the correspondence of the decompositions to the UMAP embedding of P1, see Figure 6B, also obtained from Smets et al. The five colored regions we can distinguish here are: yellow, red, blue, green, and orange. The spatial distributions of NMF for P1 show green, blue/green, orange/yellow, and two unidentifiable regions compared with the UMAP decomposition. KL‐NMF shows the green region in its first component, next blue/red, yellow, an unidentifiable region, and lastly the orange part. For the P1 sample, we see that PLSA performs better than for the M2 sample, we now obtain a visualization of similar good correspondence to UMAP as KL‐NMF. Both methods however show clearly different decompositions of the sample, but neither are able to extract the exact five regions visible in the UMAP representation. The spatial components of LDA show less detailed visualizations; they are of similar correspondence to UMAP as NMF.

Scalability

Anticipating the increase in resolution of future MSI datasets, we want to investigate how the factorization methods will be able to cope with even bigger data in the future. The use of sparse representations will be essential, but, so far, not all algorithms are able to provide equally good results using sparse operations. With the algorithms we have used for the numerical experiments, we tested their sparse equivalents to gain some insight into their behavior. NMF improves in speed with little loss in performance. Sparse KL‐NMF also improves computation time, but there is a significant loss in accuracy when doing so. The sparse equivalent of MU is not yet satisfactory for MSI data analysis. The EnsTop PLSA algorithm still shows some bugs when there is a sparse input. LDA, however, is faster and more accurate using a sparse input because of the sparse Dirichlet prior. Note that research towards factorization methods is still ongoing and might lead to new extensions for big data analysis in the future. Additionally, when looking at Table 1, the datasets do not exhibit large sparsity properties. Not many zeros are present, but there are many small values. By adding a cut‐off, one might increase sparseness. How to establish a cut‐off value and use sparse MSI data analysis is still something to be explored in the future.

Interpretability

The interpretation of the decomposition depends on the method that was used. For the algebraic methods, the obtained factors can be seen as the spectral basis elements of the data matrix and a weight matrix describing their contribution to each pixel. The components retrieved using NMF and KL‐NMF contain information which is deduced directly from the data matrix. For PLSA, we get similar components except that the weight matrix is now a probability matrix. The results show how likely a certain m/z value is present at a certain pixel. Although the spectra may look similar, they have to be approached differently regarding interpretation. LDA adds a Dirichlet prior to PLSA, imposing more sparsity in the result and better properties for generation of new “documents”. This generative property has little use in the field of MALDI‐MSI and that is why the resulting spectral distributions are not as good when decomposing with LDA; it generalizes what spectral information might be involved in that component. It could be more useful when an MSI dataset contains fragmentation effects we do not want to model, as is the case for MS/MS.

DISCUSSION

For the selection of a MSI data analysis approach there will always be a trade‐off; there is no holy grail (yet!). This work focuses on NMF‐related factorization methods and, as there are many, it is not always clear how they are alike and how they differ. We examined NMF, KL‐NMF, PLSA, and LDA both experimentally and theoretically. We can conclude that each method has its own strengths and, depending on the goal of the analysis, one might outperform the other. Interestingly, KL‐NMF has not yet been thoroughly evaluated in the MSI context but proved to be an, at least, equally useful approach. NMF is a well‐known method for data analysis in many fields as well as for MSI. Its popularity comes from the well interpretable decomposition. From the numerical assessment, we find that NMF outperforms all other methods by far in terms of computation time and has often the smallest memory usage as well. Naturally, NMF also scores best on the relative ℓ 2‐norm, as its objective function is the minimization of the squared error. Additionally, the results are easy to interpret as basis spectral elements and their spatial weighting across the samples. For MSI data factorization, we advise to always start with NMF. Even towards future, possibly bigger MSI datasets, NMF is easy to extend to its sparse equivalent with only small losses in information. A mathematically more suited extension of NMF for the factorization of MSI data is KL‐NMF. By imposing Poisson‐distributed noise, rather than Gaussian, we see that the visual results are even better. Comparison with the UMAP embedding of Figure 6 shows that the KL‐NMF components (Figures 4 and 5) extract all regions visible for M2 and almost all regions for P1. This can be seen as a huge advantage, offering a spectral distribution for the spatial regions UMAP identifies. It is in fact the only method that consistently is able to capture the spatial distributions so well, making it even more surprising it has rarely been used for MSI so far. KL‐NMF also was able to outperform the other methods in terms of both the relative ℓ 1 reconstruction error and the KL divergence error. Nevertheless, there are some disadvantages of KL‐NMF. First, computationally it takes slightly longer than NMF, with a difference of one order of magnitude. Also, the sparse implementation of the algorithm, which could be essential in the future, performs much worse. At the statistical end, we took a closer look at PLSA. It has been known for decades that it theoretically resembles KL‐NMF. More recently, it has been shown in many fields that there are in fact differences between both methods. PLSA and KL‐NMF optimize the same objective but are only identical methods if KL‐NMF adds an additional column normalization and if they use the same algorithm. MU and EM are different and will result in different solutions when tested on the same data with the same initialization. This statement is supported by looking at the reconstruction errors; for every run they each seem to go to their own optimum. The same applies to the visual results, where we can clearly see that they are converging to different local optima. When running until the update between two iterations reaches a threshold of 1e‐3, KL‐NMF shows a better result than PLSA, which is still missing information. PLSA is a popular method for MSI, where good results are obtained, while, in text mining it is becoming less and less popular. Looking at the mathematics, PLSA will work best when there are a lot of samples available. This is especially the case for MSI data, where each spectrum can be seen as a sample. Additionally, PLSA is often critiqued in text mining for being too sensitive to noise in comparison with LDA; it approximates the given dataset following the objective and does not try to find a generative model for the data. However, in the case of MSI, we specifically want every m/z bin to be equally important. Unfortunately, due to the normalization and the use of the EM algorithm, it does not perform consistently good visually. Although it is slightly faster in runtime, the statistical, normalized counterpart of KL‐NMF shows less accurate spatial distributions when we look at all samples tested, with the exception of P1 where the quality of its spatial distribution matched that of KL‐NMF. The spectral basis elements resemble the ones of KL‐NMF, albeit in a different scale due to its statistical nature. PLSA does not show absolute contributions of m/z values in its spectral components, but rather the probabilities of each bin contributing. Finally, LDA is often seen as a generalization of PLSA, as it is able to capture more general trends, emitting the overfitting of noise. For MALDI‐MSI, however, generalization is not preferred. In the details of the data, crucial information might still be present. To define what is noise and what is meaningful is very hard for MSI data. Here arises a difference with the field of text mining, where it might be easier to determine what is essential information and what are filler words. MSI is an exploratory technique where little known information is available, and it is harder to check whether something is essential or not. In our experimental setup, LDA performed worst. Computationally, it took much longer to run until convergence. For the lymph‐node data, it took approximately 12 h to run the 12 iterations. This is the reason the iterations were limited to 12. Even with a sparse input, the computation time stays within the same order of magnitude. Looking at the UMAP resemblance, the spatial components often performed better than PLSA, as they are smoother and have better contrast, but not as good as KL‐NMF. Spectrally, the distributions display little information on the chemical properties making it harder to interpret the spatial results. It has been used on MSI for the clustering of the data, improving memory usage with respect to hierarchical clustering and for MS/MS to cope with fragmentation effects. , From Tables 2 and 3, we see that the memory usage is lower than for PLSA, but NMF is still performing better. Aside from the mathematical comparison, we also want to take a closer look at the biochemical interpretation of the resulting spectra. For the pancreatic tissue M2, we refer to the works of Smets et al. , We focus on the human‐lymph‐node sample. It has become obvious that MALDI‐based MSI with a focus on the lipid mass range (600–1200 m/z,”Lipid Imaging”) is also very well suited to acquire profiles directly from the tissue. These profiles can be applied for facilitating diagnosis, prognosis, and follow‐up of diseases and therapy. An example of this ‘lipid profiling’ of selected anatomical regions is shown in Figures 7 and 8. The differences in MS‐profiles around the lymph nodules inside the tissue are explored. As shown, the majority of the molecular information (= the number and the height of mass peaks) in positive ion mode is concentrated in a relatively small mass range (700–850 m/z). This range indeed corresponds to masses of (membrane) lipids (especially phosphocholine derivates (PCs) in positive ion mode). The profiles extracted from various regions demonstrate that an important number of lipid mass peaks are common between the various anatomical structures, and that their relative abundance is of crucial importance. On the other hand, some masses seem to have a higher discriminative value. The masses of 630.7 and 632.7 m/z are found particularly in the mantle zone (see Figure 7). Comparison of normal versus tumoral lymph node tissue in the P1/P3/P4 samples (P2 did not show tumoral tissue) suggest a slightly stronger appearance of m/z 741.6, 761.6, 820.6, and 849.6 in the tumoral transformed regions (see Figure 8 for P1). These observations all underscore the importance of the data‐processing methods such as the UMAP‐based methods described earlier or the KL‐NMF‐based method described in this paper, which yield excellent decomposition and high‐quality visualization with high overall sensitivity.

FIGURE 7

FIGURE 8

Detailed analysis of the human lymph sample P1. From the entire sample, with its mean spectrum given, we deduce three regions present in the orange box around the nodules: 1. the mantle zone, 2. the core zone, and 3. the outer part. In red, we show respectively the m/z values and intensities of the four m/z values of interest to us: 741.6, 761.6, 820.6, and 849.6

Detailed analysis of the human lymph sample P1. From the entire sample, with its mean spectrum given, we deduce three regions present in the orange box around the nodules: 1. the mantle zone, 2. the core zone, and 3. the outer part. In red, we show the m/z values belonging to the top five intensities present in each region Detailed analysis of the human lymph sample P1. From the entire sample, with its mean spectrum given, we deduce three regions present in the orange box around the nodules: 1. the mantle zone, 2. the core zone, and 3. the outer part. In red, we show respectively the m/z values and intensities of the four m/z values of interest to us: 741.6, 761.6, 820.6, and 849.6 It should be noted that a positive identification of the mass peaks is less straightforward as multiple lipids can share the same or nearly the same molecular weight. For instance, lipids with the same number of double bonds, but with a difference in the position of these double bonds, all have identical masses. A suggestion of the possible identity of the mass peaks can be retrieved be consulting the Lipid Maps database (www.lipidmaps.org), but for final identification complementary analytical strategies are required, such as liquid chromatography/electrospray ionization mass spectrometry (LC/ESI‐MS) with internal lipid standards to support the identity and provide (semi)‐quantitative data on the lipid content.

CONCLUSIONS

Within the field of MSI, factorization methods are still a popular tool to interpret the chemical properties present in a tissue sample. We focused on four NMF‐related methods (NMF, KL‐NMF, PLSA, and LDA) and evaluated their performance on multiple criteria and on multiple MALDI‐MSI datasets. Each method has its significant contribution to the analysis of MSI data, but rarely are they compared with one another. We have shown that NMF is a fast, easy and accurate factorization method. However, it implies Gaussian noise whereas MSI data follows a Poisson distribution. KL‐NMF and PLSA both take this characteristic into account, but unlike often assumed, do not result in the same decomposition. LDA can be seen as a generalization on PLSA, and is often considered the better of the two, but does not offer an advantage in the context of MALDI‐MSI data factorization. In our experiments, KL‐NMF was able to represent the data most similarly to the UMAP visualization, while, surprisingly, it has rarely been used for MSI data analysis so far. Where UMAP exhibits state‐of‐the‐art MSI visualizations, factorization methods such as KL‐NMF improve data interpretation. They simultaneously decompose MSI data in both spatial and spectral distributions, offering more possibilities for further analysis of the biochemical structures present in tissues. In this work, we restricted ourselves to the four methods, but extrapolating this investigation to the implications and interpretations of other methods will offer better prospects for the interpretation of MSI data. Figure 1: Visual results of the mouse pancreas sample M1. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 2: Visual results of the mouse pancreas sample M2. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 3: Visual results of the mouse pancreas sample M3. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 4: Visual results of the human‐lymph‐node sample P1. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 5: Visual results of the human‐lymph‐node sample P2. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 6: Visual results of the human‐lymph‐node sample P3. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Figure 7: Visual results of the human‐lymph‐node sample P4. For each method: a) NMF, b) KL‐NMF, c) PLSA, and d) LDA we have executed a decomposition in 5 components. The spatial distributions are given in the top rows and the spectral distributions in the bottom rows. Click here for additional data file.

18 in total

1. Learning the parts of objects by non-negative matrix factorization.

Authors: D D Lee; H S Seung
Journal: Nature Date: 1999-10-21 Impact factor: 49.962

2. Concise representation of mass spectrometry images by probabilistic latent semantic analysis.

Authors: Michael Hanselmann; Marc Kirchner; Bernhard Y Renard; Erika R Amstalden; Kristine Glunde; Ron M A Heeren; Fred A Hamprecht
Journal: Anal Chem Date: 2008-12-15 Impact factor: 6.986

3. Principal component analysis based methods in bioinformatics studies.

Authors: Shuangge Ma; Ying Dai
Journal: Brief Bioinform Date: 2011-01-17 Impact factor: 11.622

4. Evaluation of Distance Metrics and Spatial Autocorrelation in Uniform Manifold Approximation and Projection Applied to Mass Spectrometry Imaging Data.

Authors: Tina Smets; Nico Verbeeck; Marc Claesen; Arndt Asperger; Gerard Griffioen; Thomas Tousseyn; Wim Waelput; Etienne Waelkens; Bart De Moor
Journal: Anal Chem Date: 2019-04-25 Impact factor: 6.986

Review 5. Unsupervised machine learning for exploratory data analysis in imaging mass spectrometry.

Authors: Nico Verbeeck; Richard M Caprioli; Raf Van de Plas
Journal: Mass Spectrom Rev Date: 2019-10-11 Impact factor: 10.946

6. Molecular imaging of biological samples: localization of peptides and proteins using MALDI-TOF MS.

Authors: R M Caprioli; T B Farmer; J Gile
Journal: Anal Chem Date: 1997-12-01 Impact factor: 6.986

7. Matrix Factorization Techniques for Analysis of Imaging Mass Spectrometry Data.

Authors: Peter W Siy; Richard A Moffitt; R Mitchell Parry; Yanfeng Chen; Ying Liu; M Cameron Sullards; Alfred H Merrill; May D Wang
Journal: Proc IEEE Int Symp Bioinformatics Bioeng Date: 2008-12-08

8. Time-of-flight secondary ion mass spectrometry imaging of subcellular lipid heterogeneity: Poisson counting and spatial resolution.

Authors: Paul D Piehowski; Angel M Davey; Michael E Kurczy; Erin D Sheets; Nicholas Winograd; Andrew G Ewing; Michael L Heien
Journal: Anal Chem Date: 2009-07-15 Impact factor: 6.986

9. Topic modeling for untargeted substructure exploration in metabolomics.

Authors: Justin Johan Jozias van der Hooft; Joe Wandy; Michael P Barrett; Karl E V Burgess; Simon Rogers
Journal: Proc Natl Acad Sci U S A Date: 2016-11-16 Impact factor: 11.205

10. A mathematical comparison of non-negative matrix factorization related methods with practical implications for the analysis of mass spectrometry imaging data.

Authors: Melanie Nijs; Tina Smets; Etienne Waelkens; Bart De Moor
Journal: Rapid Commun Mass Spectrom Date: 2021-11-15 Impact factor: 2.586

1 in total

1. A mathematical comparison of non-negative matrix factorization related methods with practical implications for the analysis of mass spectrometry imaging data.

Authors: Melanie Nijs; Tina Smets; Etienne Waelkens; Bart De Moor
Journal: Rapid Commun Mass Spectrom Date: 2021-11-15 Impact factor: 2.586

1 in total