| Literature DB >> 28626349 |
Gertraud Malsiner-Walli1, Sylvia Frühwirth-Schnatter2, Bettina Grün1.
Abstract
The use of a finite mixture of normal distributions in model-based clustering allows us to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures. Within the Bayesian framework, we propose a different approach based on sparse finite mixtures to achieve identifiability. We specify a hierarchical prior, where the hyperparameters are carefully selected such that they are reflective of the cluster structure aimed at. In addition, this prior allows us to estimate the model using standard MCMC sampling methods. In combination with a post-processing approach which resolves the label switching issue and results in an identified model, our approach allows us to simultaneously (1) determine the number of clusters, (2) flexibly approximate the cluster distributions in a semiparametric way using finite mixtures of normals and (3) identify cluster-specific parameters and classify observations. The proposed approach is illustrated in two simulation studies and on benchmark datasets. Supplementary materials for this article are available online.Entities:
Keywords: Bayesian nonparametric mixture model; Dirichlet prior; Finite mixture model; Model-based clustering; Normal gamma prior; Number of components
Year: 2017 PMID: 28626349 PMCID: PMC5455957 DOI: 10.1080/10618600.2016.1200472
Source DB: PubMed Journal: J Comput Graph Stat ISSN: 1061-8600 Impact factor: 2.302
Figure 1.Variance decomposition of a mixture distribution. Scatterplots of samples from a standard normal mixture distribution with three components and equal weights, with a varying amount of heterogeneity φ explained by the variation of the component means, φ = 0.1, φ = 0.5, and φ = 0.9 (from left to right).
Results for the estimated number of data clusters for various benchmark datasets, using the functions Mclust to fit a standard mixture model with K = 10 and clustCombi to estimate a mixture with combined components (column Mclust), using a sparse finite mixture model with K = 10 (column SparseMix), and estimating a sparse hierarchical mixture of mixtures model with K = 10, φ = 0.5 and φ = 0.1, and L = 4, 5 (column SparseMixMix). Priors and hyperparameter specifications are selected as described in Section 2. In parentheses, the adjusted Rand index (“1” corresponds to perfect classification) and the proportion of misclassified observations (“0” corresponds to perfect classification) are reported.
| Dataset | ||||||||
|---|---|---|---|---|---|---|---|---|
| Yeast | 626 | 3 | 2 | 8 | 6 | 6 | 2 | |
| Flea beetles | 74 | 6 | 3 | 5 | 4 | 3 | 3 | |
| AIS | 202 | 3 | 2 | 3 | 2 | 3 | 2 | |
| Wisconsin | 569 | 3 | 2 | 4 | 4 | 4 | 2 | |
| Flower | 400 | 2 | 4 | 6 | 4 | 5 | 4 | |
Figure 2.Flow cytometry dataset DLBCL. Scatterplot of the clustering results.
Figure 3.Flow cytometry dataset GvHD. Scatterplot of two variables (“FSC,” “CD8”) (left-hand side), and heatmap of the clustering results by fitting a sparse hierarchical mixture of mixtures model (right-hand side). In the heatmap, each row represents the location of a six-dimensional cluster, and each column represents a particular marker. The red, white, and blue colors denote high, medium, and low expression, respectively.