| Literature DB >> 35726283 |
Bettina Grün1, Gertraud Malsiner-Walli1, Sylvia Frühwirth-Schnatter1.
Abstract
In model-based clustering, the Galaxy data set is often used as a benchmark data set to study the performance of different modeling approaches. Aitkin (Stat Model 1:287-304) compares maximum likelihood and Bayesian analyses of the Galaxy data set and expresses reservations about the Bayesian approach due to the fact that the prior assumptions imposed remain rather obscure while playing a major role in the results obtained and conclusions drawn. The aim of the paper is to address Aitkin's concerns about the Bayesian approach by shedding light on how the specified priors influence the number of estimated clusters. We perform a sensitivity analysis of different prior specifications for the mixtures of finite mixture model, i.e., the mixture model where a prior on the number of components is included. We use an extensive set of different prior specifications in a full factorial design and assess their impact on the estimated number of clusters for the Galaxy data set. Results highlight the interaction effects of the prior specifications and provide insights into which prior specifications are recommended to obtain a sparse clustering solution. A simulation study with artificial data provides further empirical evidence to support the recommendations. A clear understanding of the impact of the prior specifications removes restraints preventing the use of Bayesian methods due to the complexity of selecting suitable priors. Also, the regularizing properties of the priors may be intentionally exploited to obtain a suitable clustering solution meeting prior expectations and needs of the application.Entities:
Keywords: Bayes; Cluster analysis; Galaxy data set; Mixture model; Prior specification
Year: 2021 PMID: 35726283 PMCID: PMC9203419 DOI: 10.1007/s11634-021-00461-8
Source DB: PubMed Journal: Adv Data Anal Classif ISSN: 1862-5355
Fig. 1The prior probabilities of K (in blue) and (in red) for the static MFM for different priors on K and values for with
Fig. 2The prior probabilities of K (in blue) and (in red) for the dynamic MFM for different priors on K and values for with
Fig. 3The prior distributions for the component means with equal to the data midpoint and , represented by the blue, purple, green and red line respectively, together with a histogram of the Galaxy data set
Fig. 4The prior distributions for induced by the prior on the component precisions with and , represented by the blue, purple, green and red line respectively, together with a histogram of the Galaxy data set
Galaxy data set. Average number of estimated data clusters , based on the mode, marginally for each of the different prior specifications
| MFM | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| Static | 5.89 | 0.01 | 2.98 | trPois(3) | 3.99 | 6.3 | 5.39 | 0.5 | 6.93 |
| Dynamic | 4.70 | 1 | 5.56 | 4.35 | 20 | 6.69 | 1 | 6.21 | |
| 10 | 7.33 | Geom(0.1) | 6.00 | 100 | 5.20 | 5 | 4.53 | ||
| U(1, 30) | 6.82 | 630 | 3.90 | 12.5 | 3.50 |
Fig. 5Galaxy data set. Estimated number of data clusters , based on the mode, for different prior specifications. In the rows, the results for the static and dynamic MFM are reported, in the columns for or , respectively
Fig. 6Galaxy data set. Entropy of the posterior of for different prior specifications. In the rows, the results for the static and dynamic MFM are reported, in the columns for or , respectively
Artificial data, maximum likelihood estimation with the BIC. Results are shown for the three different modeling approaches consisting of equal, unequal and equal as well as unequal variances for the component distributions. The estimated number of components are summarized over 100 data sets by the minimum, the 25%, 50% and 75% quantile and the maximum in square brackets
| Equal | Unequal | Equal or unequal | ||
|---|---|---|---|---|
| Gaussian | 100 | [4.0, 4.0, 5.0, 5.0, 7.0] | [4.0, 4.0, 4.0, 4.0, 5.0] | [4.0, 4.0, 4.0, 4.0, 5.0] |
| 1000 | [6.0, 7.0, 9.0, 9.0, 12.0] | [4.0, 4.0, 4.0, 4.0, 4.0] | [4.0, 4.0, 4.0, 4.0, 4.0] | |
| Uniform | 100 | [4.0, 5.0, 5.0, 6.0, 8.0] | [3.0, 4.0, 4.0, 5.0, 7.0] | [3.0, 4.0, 5.0, 5.0, 8.0] |
| 1000 | [7.0, 8.8, 9.0, 9.0, 15.0] | [5.0, 6.0, 7.0, 7.0, 9.0] | [5.0, 6.0, 7.0, 7.0, 9.0] |
Artificial data, Bayesian estimation. Average number of estimated data clusters , based on the mode, marginally for each of the different prior specifications and whether the component distributions are Gaussian or uniform distributions
| Gaussian | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MFM | |||||||||||
| 100 | 5.12 | Static | 6.15 | 0.01 | 3.97 | trPois(3) | 4.69 | 6.3 | 6.34 | 0.5 | 6.65 |
| 1000 | 5.87 | Dynamic | 4.83 | 1 | 5.25 | 4.97 | 630 | 4.65 | 12.5 | 4.34 | |
| 10 | 7.26 | Geom(0.1) | 5.66 | ||||||||
| U(1, 100) | 6.66 | ||||||||||
| Uniform | |||||||||||
| MFM | |||||||||||
| 100 | 5.56 | Static | 7.33 | 0.01 | 4.85 | trPois(3) | 5.57 | 6.3 | 7.65 | 0.5 | 8.61 |
| 1000 | 7.76 | Dynamic | 5.99 | 1 | 6.85 | 6.27 | 630 | 5.67 | 12.5 | 4.71 | |
| 10 | 8.27 | Geom(0.1) | 6.84 | ||||||||
| U(1, 100) | 7.96 | ||||||||||
Fig. 7Artificial data, dynamic MFM with . Estimated number of data clusters based on the mode for 100 data sets with different prior specifications for the prior on K, and . In the rows, the results for different samples sizes ( or 1000) are reported, in the columns for different data generating processes, mixtures of Gaussians or mixtures of uniform distributions. The results for the specifications as listed in the legend are shown from left to right within each prior on K setting
Fig. 8Artificial data, static MFM with . Estimated number of data clusters based on the mode for 100 data sets with different prior specifications for the prior on K, and . In the rows, the results for different samples sizes ( or 1000) are reported, in the columns for different data generating processes, mixtures of Gaussians or mixtures of uniform distributions. The results for the specifications as listed in the legend are shown from left to right within each prior on K setting