| Literature DB >> 23133590 |
Abstract
Non-negative matrix factorization (NMF) condenses high-dimensional data into lower-dimensional models subject to the requirement that data can only be added, never subtracted. However, the NMF problem does not have a unique solution, creating a need for additional constraints (regularization constraints) to promote informative solutions. Regularized NMF problems are more complicated than conventional NMF problems, creating a need for computational methods that incorporate the extra constraints in a reliable way. We developed novel methods for regularized NMF based on block-coordinate descent with proximal point modification and a fast optimization procedure over the alpha simplex. Our framework has important advantages in that it (a) accommodates for a wide range of regularization terms, including sparsity-inducing terms like the L1 penalty, (b) guarantees that the solutions satisfy necessary conditions for optimality, ensuring that the results have well-defined numerical meaning, (c) allows the scale of the solution to be controlled exactly, and (d) is computationally efficient. We illustrate the use of our approach on in the context of gene expression microarray data analysis. The improvements described remedy key limitations of previous proposals, strengthen the theoretical basis of regularized NMF, and facilitate the use of regularized NMF in applications.Entities:
Mesh:
Year: 2012 PMID: 23133590 PMCID: PMC3487913 DOI: 10.1371/journal.pone.0046331
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Time (seconds) needed to complete one update of all coordinates and to reach convergence in sets of gene expression data from blood disorders.
| rNMF | control | ||||
| Data set, reference | Data size | iteration | convergence | iteration | convergence |
| Acute Myeloid Leukemia | 22283×293 | 0.75 | 21.7 | 2.2 | 219.4 |
| Acute Myeloid Leukemia | 54613×461 | 3.95 | 128.8 | 10.2 | >600 |
| Acute Myeloid Leukemia | 44692×162 | 0.96 | 17.3 | 1.5 | 163.6 |
| Acute Lymphoblastic Leukemia | 22215×288 | 0.94 | 17.8 | 2.3 | 245.7 |
| Multiple Myeloma | 54613×320 | 3.04 | 29.1 | 6.4 | >600 |
All methods were implemented in C++ and identically initialized. Timings obtained on a 2.30 GHz Intel Core i7 2820QM CPU with 16 GB RAM. For convergence, we required a relative decrease in the objective function less than 10 in successive iterations. Throughout, and .
Figure 1Convergence of rNMF on real data.
Left: The objective function decreases faster with rNMF (blue) than the control method (dashed). We standardized the objective function by dividing it by the squared Frobenius norm of . Right: As predicted theoretically, rNMF closes the KKT conditions ( axis indicates the negative logarithm of the max-norm of the KKT condition matrix for , that is which should approach the zero matrix). The results in this figure were obtained for gene expression profiles of Acute Myeloid Leukemia [36], = 10, and set to yield about 50% sparsity. This example is representative as similar results were obtained for other data sets and parameter choices.
Figure 2Application to gene expression microarray data from blood disorders.
Columns indicate components, rows classes of blood cells. Blue cells indicate significant enrichment of cell type-specific markers (as detected by gene set enrichment testing; ) in the component generated by rNMF with 90% sparsity (a) and conventional NMF (b). The components have been ordered by strength (defined as norm of ) with denoting the strongest component. As discussed in detail in Results, strong components generated by rNMF capture cell type-related gene expression features more clearly than conventional NMF.