| Literature DB >> 27213413 |
Paul Fogel1, Yann Gaston-Mathé2, Douglas Hawkins3, Fajwel Fogel4, George Luta5, S Stanley Young6.
Abstract
Often data can be represented as a matrix, e.g., observations as rows and variables as columns, or as a doubly classified contingency table. Researchers may be interested in clustering the observations, the variables, or both. If the data is non-negative, then Non-negative Matrix Factorization (NMF) can be used to perform the clustering. By its nature, NMF-based clustering is focused on the large values. If the data is normalized by subtracting the row/column means, it becomes of mixed signs and the original NMF cannot be used. Our idea is to split and then concatenate the positive and negative parts of the matrix, after taking the absolute value of the negative elements. NMF applied to the concatenated data, which we call PosNegNMF, offers the advantages of the original NMF approach, while giving equal weight to large and small values. We use two public health datasets to illustrate the new method and compare it with alternative clustering methods, such as K-means and clustering methods based on the Singular Value Decomposition (SVD) or Principal Component Analysis (PCA). With the exception of situations where a reasonably accurate factorization can be achieved using the first SVD component, we recommend that the epidemiologists and environmental scientists use the new method to obtain clusters with improved quality and interpretability.Entities:
Keywords: K-means; NMF; PCA; SVD
Mesh:
Year: 2016 PMID: 27213413 PMCID: PMC4881134 DOI: 10.3390/ijerph13050509
Source DB: PubMed Journal: Int J Environ Res Public Health ISSN: 1660-4601 Impact factor: 3.390
Figure 1(a) Residual sum of squares; (b) Row clustering stability.
Figure 2NMF clustering and re-ordering of hospital admissions by city and cause. Red: High count; Blue: Low count.
High and low counts by cluster.
| Cluster | High Counts | Low Counts |
|---|---|---|
| 1 | Respiratory | CVD, CHF, MI |
| 2 | MI | CHF, diabetes |
| 3 | MI, CVD | Respiratory |
| 4 | CHF | Diabetes, MI |
Cardiovascular disease (CVD), myocardial infarction (MI), congestive heart failure (CHF).
Figure 3(a) Residual sum of squares; (b) Clustering stability.
Figure 4Affine NMF clustering. Red: High count; Blue: Low count.
Figure 5Specific clustering contribution of NMF clusters, PosNegNMF and affine NMF approaches.
Figure 6Correspondence analysis biplot of hospital admissions by city and cause (PosNegNMF clusters are represented by the city label colors).
Figure 7SVD reordering of the rows and columns of a life table. Red: High count; Blue: Low count.