| Literature DB >> 28081166 |
Kazem Nasserinejad1,2, Joost van Rosmalen1, Wim de Kort3,4, Emmanuel Lesaffre5.
Abstract
Identifying the number of classes in Bayesian finite mixture models is a challenging problem. Several criteria have been proposed, such as adaptations of the deviance information criterion, marginal likelihoods, Bayes factors, and reversible jump MCMC techniques. It was recently shown that in overfitted mixture models, the overfitted latent classes will asymptotically become empty under specific conditions for the prior of the class proportions. This result may be used to construct a criterion for finding the true number of latent classes, based on the removal of latent classes that have negligible proportions. Unlike some alternative criteria, this criterion can easily be implemented in complex statistical models such as latent class mixed-effects models and multivariate mixture models using standard Bayesian software. We performed an extensive simulation study to develop practical guidelines to determine the appropriate number of latent classes based on the posterior distribution of the class proportions, and to compare this criterion with alternative criteria. The performance of the proposed criterion is illustrated using a data set of repeatedly measured hemoglobin values of blood donors.Entities:
Mesh:
Year: 2017 PMID: 28081166 PMCID: PMC5231325 DOI: 10.1371/journal.pone.0168838
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Univariate simulated data study.
Histograms of randomly selected generated data sets. The solid lines represent the true marginal densities.
The results of Scenario A1.
Percentage of data sets in which the true number of clusters was found, with the mode of the estimated number of classes in parentheses.
| RJMCMC |
|
|
|
|
|
|
|
| DIC3 | DIC4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00001 | — | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 0%(5) | 0%(5) |
| 0.001 | 18%(10) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 0%(3) | 0%(3) |
| 0.01 | 28%(1) | 98%(1) | 98%(1) | 98%(1) | 98%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 20%(5) |
| 0.05 | 90%(1) | 22%(2) | 80%(1) | 84%(1) | 92%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 72%(1) |
| 0.1 | 98%(1) | 2%(4) | 10%(3) | 18%(2) | 40%(2) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) | 100%(1) |
| 0.3 | 98%(1) | 0%(8) | 0%(6) | 0%(5) | 0%(3) | 98%(1) | 100%(1) | 100%(1) | 100%(1) | 98%(1) | 100%(1) |
| 0.5 | 98%(1) | 0%(9) | 0%(7) | 0%(6) | 0%(5) | 96%(1) | 98%(1) | 98%(1) | 100%(1) | 96%(1) | 100%(1) |
| 0.9 | 98%(1) | 0%(10) | 0%(9) | 0%(8) | 0%(6) | 96%(1) | 98%(1) | 98%(1) | 100%(1) | 94%(1) | 100%(1) |
The success rate of BIC using a frequentist approach was 100%.
The results of Scenario A2–A4.
Percentage of data sets in which the true number of clusters was found, with the mode of the estimated number of classes in parentheses.
| Scenario | RJMCMC |
|
|
|
|
|
|
|
| DIC3 | DIC4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Scenario A2 | 0.00001 | — | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 8%(5) | 68%(3) |
| 0.001 | 6%(8) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 6%(5) | 68%(3) | |
| 0.01 | 16%(4) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 24%(5) | 54%(3) | |
| 0.05 | 54%(3) | 0%(4) | 84%(3) | 98%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 94%(3) | 52%(3) | |
| 0.1 | 94%(3) | 0%(5) | 0%(4) | 12%(4) | 86%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 68%(3) | |
| 0.3 | 100%(3) | 0%(8) | 0%(6) | 0%(5) | 0%(4) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 50%(3) | |
| 0.5 | 100%(3) | 0%(9) | 0%(8) | 0%(6) | 0%(5) | 98%(3) | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 64%(3) | |
| 0.9 | 100%(3) | 0%(10) | 0%(9) | 0%(8) | 0%(6) | 6%(4) | 80%(3) | 92%(3) | 98%(3) | 100%(3) | 94%(3) | |
| Scenario A3 | 0.00001 | — | 4%(2) | 4%(2) | 4%(2) | 4%(2) | 0%(2) | 0%(2) | 0%(2) | 0%(2) | 20%(5) | 10%(1) |
| 0.001 | 10%(10) | 6%(2) | 6%(2) | 6%(2) | 6%(2) | 2%(2) | 2%(2) | 2%(2) | 2%(2) | 18%(5) | 8%(1) | |
| 0.01 | 16%(2) | 36%(2) | 34%(2) | 34%(2) | 34%(2) | 2%(2) | 2%(2) | 2%(2) | 2%(2) | 46%(3) | 6%(1) | |
| 0.05 | 2%(2) | 38%(4) | 88%(3) | 86%(3) | 74%(3) | 2%(2) | 2%(2) | 2%(2) | 2%(2) | 28%(2) | 8%(1) | |
| 0.1 | 4%(2) | 0%(5) | 4%(4) | 32%(3) | 94%(3) | 4%(2) | 4%(2) | 2%(2) | 2%(2) | 8%(2) | 8%(1) | |
| 0.3 | 6%(2) | 0%(8) | 0%(6) | 0%(4) | 0%(4) | 28%(2) | 28%(2) | 28%(2) | 28%(2) | 2%(2) | 4%(2) | |
| 0.5 | 8%(2) | 0%(9) | 0%(8) | 0%(7) | 0%(5) | 60%(3) | 48%(2) | 46%(2) | 46%(2) | 4%(2) | 4%(5) | |
| 0.9 | 10%(2) | 0%(10) | 0%(9) | 0%(8) | 0%(6) | 0%(4) | 50%(3) | 66%(3) | 94%(3) | 10%(2) | 8%(2) | |
| Scenario A3 | 0.00001 | — | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(5) | 8%(5) |
| 0.001 | 8%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 78%(3) | 46%(3) | |
| 0.01 | 6%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 4%(1) | 34%(3) | |
| 0.05 | 0%(1) | 46%(3) | 2%(2) | 0%(2) | 0%(2) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 4%(1) | 0%(1) | |
| 0.1 | 0%(1) | 2%(4) | 80%(3) | 64%(3) | 16%(2) | 0%(1) | 0%(1) | 0%(1) | 0%(1) | 6%(1) | 0%(1) | |
| 0.3 | 0%(1) | 0%(8) | 0%(6) | 0%(5) | 8%(4) | 0%(2) | 0%(1) | 0%(1) | 0%(1) | 10%(1) | 2%(1) | |
| 0.5 | 0%(1) | 0%(9) | 0%(7) | 0%(6) | 0%(5) | 62%(3) | 0%(2) | 0%(2) | 0%(2) | 14%(1) | 0%(1) | |
| 0.9 | 0%(1) | 0%(10) | 0%(9) | 0%(8) | 0%(6) | 0%(5) | 6%(4) | 44%(4) | 98%(3) | 12%(2) | 0%(1) |
The success rates of BIC using a frequentist approach for high, moderate, and low levels of separation were 100%(3), 16%(2), and 0%(1), respectively.
Unequal proportions heterogeneous scenario; a heterogeneous population with three clusters.
λ1 = 0.475, λ2 = 0.475, λ3 = 0.05, μ1 = 1, μ2 = 2, μ3 = 3 and σ1 = σ2 = σ3 = 0.25 (high separation). Percentage of data sets in which the true number of clusters was found, with the mode of the estimated number of classes in parentheses.
| RJMCMC |
|
|
|
| DIC3 | DIC4 | |
|---|---|---|---|---|---|---|---|
| 0.00001 | — | 100%(3) | 100%(3) | 100%(3) | 48%(2) | 0%(5) | 14%(4) |
| 0.001 | 6%(7) | 98%(3) | 98%(3) | 98%(3) | 48%(2) | 2%(5) | 14%(4) |
| 0.01 | 36%(4) | 98%(3) | 98%(3) | 98%(3) | 50%(3) | 40%(3) | 24%(4) |
| 0.05 | 70%(3) | 100%(3) | 100%(3) | 100%(3) | 50%(3) | 78%(3) | 16%(4) |
| 0.1 | 98%(3) | 100%(3) | 100%(3) | 100%(3) | 50%(3) | 98%(3) | 20%(4) |
| 0.3 | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 54%(3) | 98%(3) | 28%(4) |
| 0.5 | 100%(3) | 100%(3) | 100%(3) | 100%(3) | 56%(3) | 100%(3) | 44%(3) |
| 0.9 | 100%(3) | 20%(4) | 88%(3) | 96%(3) | 80%(3) | 100%(3) | 92%(3) |
The success rate of BIC using a frequentist approach was 98%(3).
Fig 2Longitudinal simulated data study.
The left profile belongs to a homogeneous population with one class. The middle one belongs to a population with three classes where classes differ only in intercept, and the right profile belongs to a heterogeneous population with three classes where classes differ both in intercept and slope.
The results of Scenario B2.
Percentage of data sets in which the true number of clusters was found, with the mode of the estimated number of classes in parentheses.
|
|
|
|
| |
|---|---|---|---|---|
| 0.00001 | 4%(1) | 4%(1) | 4%(1) | 4%(1) |
| 0.001 | 4%(2) | 4%(2) | 4%(2) | 4%(2) |
| 0.01 | 18%(2) | 18%(2) | 18%(2) | 18%(2) |
| 0.05 | 48%(2) | 48%(2) | 48%(2) | 48%(2) |
| 0.1 | 74%(3) | 74%(3) | 74%(3) | 74%(3) |
| 0.3 | 90%(3) | 90%(3) | 90%(3) | 90%(3) |
| 0.5 | 96%(3) | 98%(3) | 98%(3) | 100%(3) |
| 0.9 | 8%(4) | 14%(4) | 20%(4) | 34%(4) |
The success rate of BIC using a frequentist approach was 98(3)%.
The results of Scenario B3.
Percentage of data sets in which the true number of clusters was found, with the mode of the estimated number of classes in parentheses.
|
|
|
|
| |
|---|---|---|---|---|
| 0.00001 | 4%(2) | 4%(2) | 4%(2) | 4%(2) |
| 0.001 | 4%(2) | 4%(2) | 4%(2) | 4%(2) |
| 0.01 | 6%(2) | 6%(2) | 6%(2) | 6%(2) |
| 0.05 | 6%(2) | 6%(2) | 6%(2) | 6%(2) |
| 0.1 | 6%(2) | 6%(2) | 6%(2) | 6%(2) |
| 0.3 | 10%(2) | 10%(2) | 10%(2) | 10%(2) |
| 0.5 | 10%(2) | 10%(2) | 10%(2) | 10%(2) |
| 1.0 | 22%(2) | 22%(2) | 22%(2) | 20%(2) |
| 1.5 | 36%(2) | 36%(2) | 36%(2) | 32%(2) |
| 2.0 | 58%(3) | 58%(3) | 58%(3) | 52%(3) |
| 2.5 | 36%(3) | 36%(3) | 36%(3) | 34%(4) |
The success rate of BIC using a frequentist approach was 46%(3).
Number of latent classes in Hb data for different α and different cut-offs (ψ).
|
|
|
|
| |
|---|---|---|---|---|
| 0.5 | 1 | 1 | 1 | 1 |
| 1.0 | 1 | 1 | 1 | 1 |
| 1.5 | 2 | 2 | 2 | 2 |
| 2.0 | 4 | 4 | 4 | 3 |
| 2.5 | 4 | 4 | 4 | 3 |
BIC using a frequentist approach found 2 classes.
Fig 3Hb profiles for four different classes.
Fig 4Posterior distribution of non-empty classes (K) for different cut-offs (ψ).