| Literature DB >> 27329648 |
Eleni Matechou1, Ivy Liu2, Daniel Fernández2, Miguel Farias3, Bergljot Gjelsvik4,5.
Abstract
The work in this paper introduces finite mixture models that can be used to simultaneously cluster the rows and columns of two-mode ordinal categorical response data, such as those resulting from Likert scale responses. We use the popular proportional odds parameterisation and propose models which provide insights into major patterns in the data. Model-fitting is performed using the EM algorithm, and a fuzzy allocation of rows and columns to corresponding clusters is obtained. The clustering ability of the models is evaluated in a simulation study and demonstrated using two real data sets.Entities:
Keywords: EM algorithm; Likert scale; fuzzy clustering; proportional odds
Mesh:
Year: 2016 PMID: 27329648 PMCID: PMC4978779 DOI: 10.1007/s11336-016-9503-3
Source DB: PubMed Journal: Psychometrika ISSN: 0033-3123 Impact factor: 2.500
Model set with corresponding number of parameters
|
|
|
|
|
|---|---|---|---|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The following constraints are placed, where appropriate: . : a single row cluster, : r row clusters, : each row is in its own cluster. Similarly, : a single column cluster, : c column clusters and : each column is in its own cluster. For example, when , , the rows form one cluster, while the columns form c clusters and the logits of the cumulative probabilities in the PO model for column cluster c and are , for all rows. If on the other hand , , the cumulative probabilities for row i, column cluster c are, assuming an interaction between row and column effects and ,
Information criteria summary table.
| Criteria | Definition | Proposed for Depending on | |
|---|---|---|---|
| AIC (Akaike, |
| Regression |
|
|
|
|
| |
|
|
| ||
| CAIC (Bozdogan, |
| ||
| BIC (Schwarz, |
| ||
| AIC3 (Bozdogan, |
| Clustering |
|
| CLC (Biernacki & Govaert, |
|
| |
| NEC(R) (Biernacki, Celeux, & Govaert, |
| ||
| ICL-BIC (Biernacki et al., |
|
| |
| AWE (Banfield & Raftery, |
| ||
is the maximised incomplete-data log-likelihood (see Eq. 5); is the maximised incomplete-data log-likelihood without clustering structure; and is the maximised complete-data log-likelihood given in Appendix A. The third column categorises the criteria according to whether they were proposed for model selection in a regression setting or for clustering. The last column indicates whether the penalty depends on the number of parameters, , the total sample size which is the number of elements in the response matrix Y, np, and/or the entropy function,
Fig. 1Simulation study to assess the performance of model selection criteria in recovering the true number of clusters for our proposed biclustering finite mixture PO (POFM) model. Bars depict the percentage of cases when the true model is correctly identified by each criterion, averaged across the five scenarios.
The average estimate obtained for each parameter over 100 simulations.
|
|
| True |
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| (0.5, 0.5) | (0.4, 0.6) | (0.3, 0.7) | (0.2, 0.8) | |||||||
|
| 5 |
| 5 | 3 | 5 | 3 | 5 | |||
| 9 | 10 |
| 1.40 | 1.46 | 1.43 | 1.58 | 1.43 | 1.56 | 1.46 | 1.49 |
| 10 |
| 3.03 | 1.99 | 2.30 | 2.22 | 2.40 | 1.95 | 2.37 | 1.99 | |
| 10 |
| 1.33 | 1.02 | 0.98 | 0.90 | 0.76 | 0.86 | 0.73 | 0.71 | |
| 20 |
| 1.42 | 1.38 | 1.42 | 1.43 | 1.41 | 1.38 | 1.45 | 1.40 | |
| 20 |
| 1.88 | 1.91 | 1.95 | 1.90 | 2.07 | 1.84 | 2.00 | 1.92 | |
| 20 |
| 0.95 | 0.91 | 1.43 | 0.84 | 1.14 | 0.93 | 0.71 | 0.69 | |
| 100 |
| 1.31 | 1.42 | 1.34 | 1.43 | 1.38 | 1.44 | 1.37 | 1.44 | |
| 100 |
| 1.88 | 1.97 | 1.90 | 2.00 | 1.92 | 1.99 | 1.92 | 2.00 | |
| 100 |
| 1.07 | 0.88 | 0.93 | 0.81 | 1.24 | 1.02 | 0.98 | 0.88 | |
| 30 | 10 |
| 1.41 | 1.44 | 1.43 | 1.37 | 1.38 | 1.45 | 1.40 | 1.38 |
| 10 |
| 2.47 | 2.23 | 2.70 | 2.30 | 2.54 | 2.09 | 2.90 | 1.94 | |
| 10 |
| 1.01 | 0.96 | 1.07 | 0.93 | 0.96 | 0.92 | 0.94 | 0.78 | |
| 20 |
| 1.26 | 1.18 | 1.15 | 1.19 | 1.19 | 1.22 | 1.19 | 1.23 | |
| 20 |
| 1.96 | 1.98 | 2.02 | 2.05 | 2.06 | 1.96 | 2.08 | 2.04 | |
| 20 |
| 0.95 | 0.96 | 1.02 | 1.00 | 1.02 | 1.02 | 0.91 | 1.00 | |
| 100 |
| 1.11 | 1.30 | 1.16 | 1.34 | 1.16 | 1.34 | 1.17 | 1.32 | |
| 100 |
| 1.96 | 1.98 | 1.92 | 1.98 | 1.93 | 1.99 | 1.95 | 1.99 | |
| 100 |
| 0.97 | 0.95 | 0.96 | 0.95 | 0.98 | 0.97 | 0.97 | 0.96 | |
| 99 | 10 |
| 1.22 | 1.24 | 1.42 | 1.31 | 1.22 | 1.22 | 1.39 | 1.19 |
| 10 |
| 2.28 | 2.16 | 2.32 | 2.22 | 2.33 | 2.21 | 2.47 | 2.16 | |
| 10 |
| 1.00 | 0.97 | 1.01 | 0.99 | 1.01 | 1.00 | 0.96 | 0.98 | |
| 20 |
| 1.05 | 1.02 | 1.03 | 1.03 | 1.06 | 1.01 | 1.06 | 1.06 | |
| 20 |
| 2.04 | 1.99 | 2.04 | 2.04 | 2.05 | 1.97 | 2.06 | 2.01 | |
| 20 |
| 1.01 | 0.99 | 1.00 | 1.00 | 0.98 | 0.99 | 0.99 | 0.98 | |
| 100 |
| 1.03 | 1.13 | 1.04 | 1.14 | 1.05 | 1.19 | 1.04 | 1.17 | |
| 100 |
| 1.99 | 1.99 | 1.99 | 2.00 | 1.97 | 2.00 | 1.99 | 1.99 | |
| 100 |
| 0.99 | 1.00 | 1.00 | 0.99 | 1.00 | 0.99 | 1.00 | 0.99 | |
The average Rand index for 100 simulated data sets based on our proposed (POFM) and double k-means (dkm) methods.
|
|
|
|
| |||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
| (0.5, 0.5) | (0.2, 0.8) | (0.5, 0.5) | (0.2, 0.8) | ||||||
| Cluster | POFM | dkm | POFM | dkm | POFM | dkm | POFM | dkm | ||
| 9 | 10 | Row | 0.61 | 0.75 | 0.63 | 0.72 | 0.65 | 0.76 | 0.64 | 0.74 |
| 10 | Col. | 0.64 | 0.63 | 0.60 | 0.54 | 0.65 | 0.59 | 0.59 | 0.52 | |
| 20 | Row | 0.74 | 0.78 | 0.73 | 0.80 | 0.75 | 0.76 | 0.71 | 0.78 | |
| 20 | Col. | 0.64 | 0.59 | 0.60 | 0.55 | 0.65 | 0.60 | 0.61 | 0.53 | |
| 100 | Row | 0.81 | 0.99 | 0.79 | 0.97 | 0.77 | 0.98 | 0.76 | 0.97 | |
| 100 | Col. | 0.66 | 0.62 | 0.62 | 0.55 | 0.70 | 0.64 | 0.65 | 0.57 | |
| 30 | 10 | Row | 0.65 | 0.70 | 0.66 | 0.70 | 0.66 | 0.70 | 0.67 | 0.72 |
| 10 | Col. | 0.75 | 0.76 | 0.80 | 0.60 | 0.86 | 0.75 | 0.73 | 0.67 | |
| 20 | Row | 0.76 | 0.77 | 0.78 | 0.78 | 0.78 | 0.79 | 0.78 | 0.80 | |
| 20 | Col. | 0.90 | 0.80 | 0.86 | 0.65 | 0.91 | 0.83 | 0.86 | 0.71 | |
| 100 | Row | 0.92 | 0.99 | 0.91 | 0.99 | 0.85 | 0.99 | 0.84 | 0.99 | |
| 100 | Col. | 0.91 | 0.84 | 0.93 | 0.74 | 0.93 | 0.87 | 0.94 | 0.79 | |
| 99 | 10 | Row | 0.68 | 0.70 | 0.68 | 0.71 | 0.69 | 0.71 | 0.69 | 0.71 |
| 10 | Col. | 0.99 | 0.96 | 0.95 | 0.85 | 0.99 | 0.97 | 0.93 | 0.88 | |
| 20 | Row | 0.78 | 0.80 | 0.80 | 0.81 | 0.82 | 0.81 | 0.81 | 0.82 | |
| 20 | Col. | 0.99 | 0.99 | 0.99 | 0.97 | 1.00 | 0.99 | 0.97 | 0.98 | |
| 100 | Row | 0.98 | 0.99 | 0.97 | 0.99 | 0.92 | 0.99 | 0.91 | 0.99 | |
| 100 | Col. | 0.99 | 0.99 | 1.00 | 0.98 | 1.00 | 0.99 | 1.00 | 0.99 | |
The average Rand index based on our proposed (POFM) and double k-means (dkm) methods for 1000 simulated data sets.
|
|
| Method |
|
|
| ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 5 | 7 | 3 | 5 | 7 | 3 | 5 | 7 | |||
| 9 | 10 | POFM | 0.61 | 0.63 | 0.64 | 0.73 | 0.78 | 0.80 | 0.74 | 0.75 | 0.75 |
|
| 0.68 | 0.69 | 0.69 | 0.70 | 0.72 | 0.73 | 0.72 | 0.74 | 0.75 | ||
| 20 | POFM | 0.70 | 0.72 | 0.73 | 0.79 | 0.86 | 0.88 | 0.77 | 0.76 | 0.75 | |
|
| 0.70 | 0.71 | 0.72 | 0.71 | 0.73 | 0.74 | 0.74 | 0.77 | 0.78 | ||
| 100 | POFM | 0.85 | 0.84 | 0.83 | 0.94 | 0.94 | 0.86 | 0.75 | 0.75 | 0.75 | |
|
| 0.74 | 0.77 | 0.78 | 0.74 | 0.77 | 0.78 | 0.79 | 0.88 | 0.90 | ||
| 30 | 10 | POFM | 0.65 | 0.67 | 0.68 | 0.75 | 0.81 | 0.84 | 0.76 | 0.77 | 0.77 |
|
| 0.66 | 0.67 | 0.68 | 0.70 | 0.72 | 0.73 | 0.71 | 0.74 | 0.76 | ||
| 20 | POFM | 0.73 | 0.76 | 0.77 | 0.84 | 0.93 | 0.95 | 0.78 | 0.78 | 0.78 | |
|
| 0.70 | 0.72 | 0.72 | 0.72 | 0.75 | 0.76 | 0.75 | 0.80 | 0.81 | ||
| 100 | POFM | 0.94 | 0.92 | 0.91 | 0.95 | 0.99 | 0.92 | 0.77 | 0.77 | 0.77 | |
|
| 0.79 | 0.83 | 0.86 | 0.76 | 0.84 | 0.87 | 0.93 | 0.97 | 0.98 | ||
| 99 | 10 | POFM | 0.67 | 0.68 | 0.69 | 0.76 | 0.84 | 0.88 | 0.76 | 0.77 | 0.78 |
|
| 0.67 | 0.68 | 0.68 | 0.70 | 0.72 | 0.73 | 0.72 | 0.75 | 0.76 | ||
| 20 | POFM | 0.75 | 0.78 | 0.80 | 0.86 | 0.95 | 0.97 | 0.79 | 0.78 | 0.78 | |
|
| 0.71 | 0.73 | 0.74 | 0.73 | 0.77 | 0.80 | 0.79 | 0.85 | 0.86 | ||
| 100 | POFM | 0.98 | 0.97 | 0.96 | 0.97 | 1.00 | 0.97 | 0.78 | 0.78 | 0.78 | |
|
| 0.88 | 0.92 | 0.93 | 0.82 | 0.87 | 0.89 | 0.99 | 0.99 | 0.99 | ||
Fig. 2Estimated probabilities of replying 3 or above to each of the 2 column clusters for all 3 row clusters, as derived by the biclustering model with , .
Percent of individuals from the five POFM clusters, represented in the rows, that are clustered in the corresponding five double k-means (Vichi, 2001) clusters.
| POFM cluster | Double | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | |
| 1 | 100 | 0 | 0 | 0 | 0 |
| 2 | 26 | 72 | 2 | 0 | 0 |
| 3 | 0 | 10 | 48 | 23 | 0 |
| 4 | 0 | 0 | 0 | 0 | 100 |
| 5 | 0 | 0 | 21 | 30 | 49 |