| Literature DB >> 32302350 |
Kiero Guerra-Peña1, Zoilo Emilio García-Batista1, Sarah Depaoli2, Luis Eduardo Garrido1.
Abstract
Growth Mixture Modeling (GMM) has gained great popularity in the last decades as a methodology for longitudinal data analysis. The usual assumption of normally distributed repeated measures has been shown as problematic in real-life data applications. Namely, performing normal GMM on data that is even slightly skewed can lead to an over selection of the number of latent classes. In order to ameliorate this unwanted result, GMM based on the skew t family of continuous distributions has been proposed. This family of distributions includes the normal, skew normal, t, and skew t. This simulation study aims to determine the efficiency of selecting the "true" number of latent groups in GMM based on the skew t family of continuous distributions, using fit indices and likelihood ratio tests. Results show that the skew t GMM was the only model considered that showed fit indices and LRT false positive rates under the 0.05 cutoff value across sample sizes and for normal, and skewed and kurtic data. Simulation results are corroborated by a real educational data application example. These findings favor the development of practical guides of the benefits and risks of using the GMM based on this family of distributions.Entities:
Year: 2020 PMID: 32302350 PMCID: PMC7164627 DOI: 10.1371/journal.pone.0231525
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Line plots for normal (skew 0 and kurtosis 0) versus nonnormal data (skew 1.6 and kurtosis 4).
Convergence rate by distribution, data condition and sample size (500 replications).
| N | Distribution | Normal | Skew Normal | Skew | |
|---|---|---|---|---|---|
| 50 | Normal | 492 (0.98) | 317 (0.63) | 433 (0.87) | 35 (0.07) |
| S. Nonnormal | 484 (0.97) | 348 (0.70) | 417 (0.83) | 51 (0.10) | |
| Nonnormal | 485 (0.97) | 359 (0.72) | 406 (0.81) | 56 (0.11) | |
| 200 | Normal | 488 (0.98) | 286 (0.57) | 450 (0.90) | 46 (0.09) |
| S. Nonnormal | 498 (0.96) | 288 (0.58) | 432 (0.87) | 98 (0.20) | |
| Nonnormal | 481 (0.96) | 274 (0.55) | 432 (0.86) | 125 (0.25) | |
| 800 | Normal | 488 (0.98) | 281 (0.56) | 470 (0.94) | 116 (0.23) |
| S. Nonnormal | 483 (0.97) | 276 (0.55) | 459 (0.92) | 149 (0.30) | |
| Nonnormal | 491 (0.98) | 264 (0.53) | 464 (0.93) | 178 (0.36) | |
| 3,200 | Normal | 470 (0.94) | 295 (0.59) | 473 (0.95) | 217 (0.43) |
| S. Nonnormal | 500 (1.00) | 457 (0.91) | 489 (0.98) | 185 (0.37) | |
| Nonnormal | 500 (1.00) | 445 (0.89) | 484 (0.97) | 230 (0.46) |
Values are frecuencies (and proportions) of convergence across replications. Each replication was allowed 1,000 iterations to converge. Data conditions are normal (s = k = 0), s. nonnormal (slightly nonnormal, s = 1, k = 2) and nonnormal (s = 1.6, k = 4).
Time of convergence by distribution and sample size for 1-class and 2-class solutions (500 replications).
| N | Classes | Distribution | Normal | Skew Normal | Skew | |
|---|---|---|---|---|---|---|
| 50 | 1 | Normal | 0.43 | 0.27 | 0.45 | 0.18 |
| S. Nonnormal | 0.63 | 0.07 | 0.27 | 0.12 | ||
| Nonnormal | 0.45 | 0.57 | 0.25 | 0.12 | ||
| 2 | Normal | 191.63 | 12.58 | 12.83 | 7.18 | |
| S. Nonnormal | 43.68 | 1.30 | 9.00 | 2.90 | ||
| Nonnormal | 39.03 | 1.32 | 8.57 | 2.83 | ||
| 200 | 1 | Normal | 0.50 | 0.25 | 1.40 | 0.37 |
| S. Nonnormal | 0.83 | 0.08 | 0.72 | 0.27 | ||
| Nonnormal | 0.53 | 0.70 | 0.67 | 0.25 | ||
| 2 | Normal | 35.23 | 0.25 | 1.40 | 0.37 | |
| S. Nonnormal | 0.83 | 0.08 | 0.72 | 0.27 | ||
| Nonnormal | 41.93 | 2.22 | 24.23 | 8.03 | ||
| 800 | 1 | Normal | 3.20 | 0.42 | 4.97 | 1.18 |
| S. Nonnormal | 1.07 | 0.17 | 2.38 | 0.83 | ||
| Nonnormal | 0.63 | 1.57 | 2.47 | 0.85 | ||
| 2 | Normal | 109.48 | 8.13 | 215.70 | 46.67 | |
| S. Nonnormal | 68.20 | 5.77 | 80.90 | 23.13 | ||
| Nonnormal | 88.27 | 5.77 | 84.70 | 29.58 | ||
| 3,200 | 1 | Normal | 5.92 | 1.03 | 18.35 | 4.30 |
| S. Nonnormal | 1.58 | 0.47 | 8.63 | 3.15 | ||
| Nonnormal | 0.88 | 4.42 | 10.25 | 3.15 | ||
| 2 | Normal | 271.03 | 25.08 | 631.22 | 148.68 | |
| S. Nonnormal | 123.73 | 20.23 | 319.00 | 97.63 | ||
| Nonnormal | 59.57 | 21.00 | 334.02 | 122.53 |
Values are a range of minutes that the EM algorithm needed to converge for all replications. Each replication was allowed 1,000 iterations to converge.
Descriptive statistics and correlation matrix of reading scores for each of the four time points for ECLS-K database.
| Time | Mean | Variance | Skewness | Kurtosis | time 1 | time 2 | time 3 |
|---|---|---|---|---|---|---|---|
| 1 | 23.90 | 78.91 | |||||
| 2 | 34.43 | 120.18 | |||||
| 3 | 39.48 | 154.78 | |||||
| 4 | 57.40 | 172.51 |
N = 3856. Significant coefficients appear bolded.
Fig 2Histogram for reading scores for each time point.
Fit index and LRT false positive rate (of 500 samples) for all models and normal data (skew and kurtosis = 0).
| Distribution | AIC | BIC | SBIC | VLMR-LRT | LMR-LRT | BLRT | |
|---|---|---|---|---|---|---|---|
| 50 | Normal | 0.81 | 0.35 | 0.99 | 0.13 | 0.11 | |
| Skew Normal | 0.08 | 0.53 | 0.12 | 0.10 | |||
| Skew | 0.06 | ||||||
| 0.09 | 0.83 | ||||||
| 200 | Normal | 0.73 | 0.09 | 0.67 | 0.11 | 0.10 | |
| Skew Normal | 0.12 | 0.12 | |||||
| Skew | 0.11 | 0.11 | |||||
| 800 | Normal | 0.72 | 0.27 | 0.17 | 0.15 | ||
| Skew Normal | 0.09 | 0.08 | |||||
| Skew | |||||||
| 3,200 | Normal | 0.81 | 0.08 | 0.23 | 0.22 | ||
| Skew Normal | |||||||
| Skew | |||||||
| 0.06 |
AIC = Akaike’s information criterion; BIC = Bayesian information criterion; SBIC = Sample corrected BIC; VLMR-LRT = Voung-Lo-Mendell-Rubin LRT; LMR-adjusted LRT = Lo-Mendell-Rubin adjusted LRT; and BLRT = Bootstrap LRT. Type I error rates ≤ 0.05 appear bolded.
Fit index false positive rate (of 500 samples) for all models and nonnormal data (skew = 1.6 and kurtosis = 4).
| Distribution | AIC | BIC | SBIC | VLMR-LRT | LMR-LRT | BLRT | |
|---|---|---|---|---|---|---|---|
| 50 | Normal | 0.81 | 0.32 | 0.99 | 0.16 | 0.14 | |
| Skew Normal | 0.09 | 0.60 | 0.09 | 0.09 | |||
| Skew | 0.06 | ||||||
| 0.27 | 0.86 | 0.18 | 0.18 | ||||
| 200 | Normal | 0.83 | 0.15 | 0.79 | 0.22 | 0.21 | |
| Skew Normal | 0.23 | 0.20 | 0.15 | 0.15 | |||
| Skew | 0.10 | 0.10 | |||||
| 0.49 | 0.44 | 0.31 | 0.30 | ||||
| 800 | Normal | 1.00 | 0.41 | 0.88 | 0.63 | 0.62 | 0.41 |
| Skew Normal | 0.57 | 0.25 | 0.36 | 0.36 | |||
| Skew | |||||||
| 0.99 | 0.17 | 0.80 | 0.64 | 0.63 | |||
| 3,200 | Normal | 1.00 | 1.00 | 1.00 | 0.92 | 0.91 | 1.00 |
| Skew Normal | 0.99 | 0.47 | 0.82 | 0.75 | 0.74 | ||
| Skew | |||||||
| 1.00 | 0.99 | 1.00 | 0.87 | 0.87 |
AIC = Akaike’s information criterion; BIC = Bayesian information criterion; SBIC = Sample corrected BIC; VLMR-LRT = Voung-Lo-Mendell-Rubin LRT; LMR-adjusted LRT = Lo-Mendell-Rubin adjusted LRT; and BLRT = Bootstrap LRT. Type I error rates ≤ 0.05 appear bolded.
Fig 3Line plots for BIC false positive rate across skew-t family GMM for each distribution condition.
BIC of one-class versus two-class models (of 500 samples) for all models by sample size and distributional conditions.
| Distribution | Skew 0, kurtosis 0 | Skew 1, kurtosis 2 | Skew 1.6, kurtosis 4 | ||||
|---|---|---|---|---|---|---|---|
| 1 class | 2 classes | 1 class | 2 classes | 1 class | 2 classes | ||
| 50 | Normal | 1047.53 | 1041.20 | 1038.70 | |||
| Skew Normal | 1073.66 | 1065.84 | 1065.08 | ||||
| Skew | 1085.08 | 1070.92 | 1065.16 | ||||
| 1070.90 | 1050.77 | 1044.12 | |||||
| 200 | Normal | 4117.22 | 4110.34 | 4106.24 | |||
| Skew Normal | 4143.18 | 4128.64 | 4128.73 | ||||
| Skew | 4163.27 | 4111.92 | 4085.75 | ||||
| 4136.17 | 4079.85 | 4058.50 | |||||
| 800 | Normal | 16328.13 | 16304.96 | 16292.09 | |||
| Skew Normal | 16362.82 | 16326.21 | 16304.20 | ||||
| Skew | 16381.07 | 16186.08 | 16088.54 | ||||
| 16344.86 | 16138.73 | 16039.22 | |||||
| 3,200 | Normal | 65091.01 | 65064.35 | 65063.69 | |||
| Skew Normal | 65142.69 | 65057.66 | 65012.48 | ||||
| Skew | 65157.76 | 64407.58 | 64018.50 | ||||
| 65117.83 | 64352.65 | 63990.71 | |||||
BIC = Bayesian information criterion. The lowest BIC values within the same distribution appear bolded.
aLowest BIC values across classes and distributions.
Fig 4Line plots for LMR-adjusted LRT false positive rate across skew-t family GMM for each distribution condition.
Number of classes suggested by the BIC for each member of the skew t GMM family for the ECLS-K database.
| Number of classes | Normal | Skew Normal | Skew | |
|---|---|---|---|---|
| 1 | 107856.48 | 107598.56 | 107312.89 | 107306.81 |
| 2 | 105566.12 | 104717.86 | 104699.88 | 105555.77 |
| 3 | 104831.19 | 104233.29 | 104183.99 | 104854.35 |
| 4 | 104268.16 | 103848.48 | 104285.40 | |
| 5 | 103793.69 | 104100.77 | 104151.61 | |
| 6 | 104323.23 | |||
| 7 | 104028.70 | 104200.93 |
BIC = Bayesian information criterion. The lowest BIC values within the same distribution appear bolded.
aLowest BIC values across classes and distributions.
Fit index false positive rate (of 500 samples) for all models and slightly nonnormal data (skew = 1 and kurtosis = 2).
| Distribution | AIC | BIC | SBIC | VLMR-LRT | LMR-LRT | BLRT | |
|---|---|---|---|---|---|---|---|
| 50 | Normal | 0.77 | 0.68 | 0.99 | 0.18 | 0.16 | |
| Skew Normal | 0.08 | 0.58 | 0.10 | 0.10 | |||
| Skew | |||||||
| 0.29 | 0.78 | 0.16 | 0.16 | ||||
| 200 | Normal | 0.76 | 0.09 | 0.71 | 0.17 | 0.17 | |
| Skew Normal | 0.15 | 0.12 | 0.15 | 0.14 | |||
| Skew | 0.06 | 0.06 | |||||
| 0.36 | 0.34 | 0.27 | 0.24 | ||||
| 800 | Normal | 0.95 | 0.15 | 0.69 | 0.46 | 0.43 | 0.14 |
| Skew Normal | 0.45 | 0.15 | 0.35 | 0.34 | |||
| Skew | |||||||
| 0.87 | 0.46 | 0.48 | 0.48 | ||||
| 3,200 | Normal | 1.00 | 0.86 | 0.99 | 0.87 | 0.87 | 0.94 |
| Skew Normal | 0.96 | 0.27 | 0.67 | 0.75 | 0.74 | ||
| Skew | |||||||
| 1.00 | 0.99 | 1.00 | 0.87 | 0.87 |
AIC = Akaike’s information criterion; BIC = Bayesian information criterion; SBIC = Sample corrected BIC; VLMR-LRT = Voung-Lo-Mendell-Rubin LRT; LMR-adjusted LRT = Lo-Mendell-Rubin adjusted LRT; and BLRT = Bootstrap LRT. Type I error rates ≤ 0.05 appear bolded.