| Literature DB >> 36043219 |
Tin Lok James Ng1, Thomas Brendan Murphy2.
Abstract
A probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in real-world problems. This model is an extension of the latent class analysis model that introduces two clustering structures for hyperedges and captures variation in the size of hyperedges. An expectation maximization algorithm with minorization maximization steps is developed to perform parameter estimation. Model selection using Bayesian Information Criterion is proposed. The model is applied to simulated data and two real-world data sets where interesting results are obtained.Entities:
Keywords: Clustering; Hypergraph; Latent class analysis; Minorization maximization
Year: 2021 PMID: 36043219 PMCID: PMC9418112 DOI: 10.1007/s11634-021-00454-7
Source DB: PubMed Journal: Adv Data Anal Classif ISSN: 1862-5355
Fig. 1A hypergraph representation of a coauthorship network
Fig. 2Bipartite graph representation of the hypergraph in Fig. 1
Convergence analysis of the EM algorithm for the ELCA model with 2 primary clusters and 2 additional clusters
| Model | |||||||
|---|---|---|---|---|---|---|---|
| 10-node ( | 100 | 0.0465 | 0.0224 | 0.0269 | 0.0630 | 0.0412 | 0.1561 |
| 500 | 0.0205 | 0.0075 | 0.0083 | 0.0315 | 0.0374 | 0.1463 | |
| 1000 | 0.0124 | 0.0043 | 0.0064 | 0.0199 | 0.0379 | 0.1428 | |
| 10-node ( | 100 | 0.0549 | 0.0292 | 0.0147 | 0.0491 | 0.0293 | 0.1450 |
| 500 | 0.0248 | 0.0108 | 0.0082 | 0.0296 | 0.0266 | 0.1454 | |
| 1000 | 0.0209 | 0.0046 | 0.0039 | 0.0199 | 0.0273 | 0.1453 | |
| 10-node ( | 100 | 0.0546 | 0.0176 | 0.0173 | 0.0435 | 0.0380 | 0.1332 |
| 500 | 0.0257 | 0.0053 | 0.0106 | 0.0220 | 0.0374 | 0.1328 | |
| 1000 | 0.0146 | 0.0027 | 0.0053 | 0.0173 | 0.0362 | 0.1312 | |
| 10-node ( | 100 | 0.0698 | 0.0137 | 0.0213 | 0.0441 | 0.0365 | 0.1430 |
| 500 | 0.0247 | 0.0082 | 0.0094 | 0.0189 | 0.0372 | 0.1279 | |
| 1000 | 0.0168 | 0.0040 | 0.0082 | 0.0132 | 0.0358 | 0.1235 | |
| 20-node ( | 100 | 0.0559 | 0.0120 | 0.0216 | 0.0195 | 0.0065 | 0.0750 |
| 500 | 0.0170 | 0.0039 | 0.0102 | 0.0103 | 0.0059 | 0.0720 | |
| 1000 | 0.0114 | 0.0037 | 0.0051 | 0.0101 | 0.0056 | 0.0701 | |
| 20-node ( | 100 | 0.0450 | 0.0127 | 0.0250 | 0.0301 | 0.0102 | 0.0640 |
| 500 | 0.0232 | 0.0041 | 0.0080 | 0.0087 | 0.0061 | 0.0620 | |
| 1000 | 0.0112 | 0.0024 | 0.0054 | 0.0082 | 0.0062 | 0.0624 | |
| 20-node ( | 100 | 0.0389 | 0.0120 | 0.0278 | 0.0309 | 0.0090 | 0.0635 |
| 500 | 0.0242 | 0.0040 | 0.0081 | 0.0133 | 0.0089 | 0.0613 | |
| 1000 | 0.0135 | 0.0018 | 0.0077 | 0.0112 | 0.0086 | 0.0604 | |
| 20-node ( | 100 | 0.0558 | 0.0100 | 0.0172 | 0.0304 | 0.0082 | 0.0724 |
| 500 | 0.0194 | 0.0039 | 0.0139 | 0.0121 | 0.0068 | 0.0686 | |
| 1000 | 0.0108 | 0.0021 | 0.0071 | 0.0061 | 0.0067 | 0.0627 |
The distance between the true parameters of and the estimated ones, and the misclassification rates for both the primary () and additional clusters () are presented
Convergence analysis of the EM algorithm for the ELCA model with 3 primary clusters and 2 additional clusters
| Model | |||||||
|---|---|---|---|---|---|---|---|
| 10-node ( | 100 | 0.1286 | 0.0399 | 0.0235 | 0.0778 | 0.1997 | 0.1858 |
| 500 | 0.0747 | 0.0076 | 0.0108 | 0.0352 | 0.1758 | 0.1692 | |
| 1000 | 0.0541 | 0.0069 | 0.0099 | 0.0138 | 0.1575 | 0.1553 | |
| 10-node ( | 100 | 0.1317 | 0.0368 | 0.0589 | 0.0590 | 0.1715 | 0.1620 |
| 500 | 0.0850 | 0.0117 | 0.0448 | 0.0363 | 0.1582 | 0.1573 | |
| 1000 | 0.0534 | 0.0052 | 0.0216 | 0.0173 | 0.1529 | 0.1542 | |
| 10-node ( | 100 | 0.1329 | 0.0432 | 0.0277 | 0.0522 | 0.2335 | 0.1375 |
| 500 | 0.1053 | 0.0106 | 0.0126 | 0.0160 | 0.2172 | 0.1318 | |
| 1000 | 0.0698 | 0.0063 | 0.0112 | 0.0171 | 0.2038 | 0.1291 | |
| 10-node ( | 100 | 0.1318 | 0.0390 | 0.0782 | 0.0319 | 0.2162 | 0.1485 |
| 500 | 0.0866 | 0.0091 | 0.0521 | 0.0162 | 0.1941 | 0.1292 | |
| 1000 | 0.0745 | 0.0052 | 0.0368 | 0.0158 | 0.1877 | 0.1241 | |
| 20-node ( | 100 | 0.1083 | 0.0194 | 0.0208 | 0.0390 | 0.1655 | 0.1105 |
| 500 | 0.0523 | 0.0039 | 0.0058 | 0.0139 | 0.1293 | 0.1045 | |
| 1000 | 0.0356 | 0.0019 | 0.0028 | 0.0069 | 0.1208 | 0.1014 | |
| 20-node ( | 100 | 0.1217 | 0.0169 | 0.0597 | 0.0398 | 0.1647 | 0.1020 |
| 500 | 0.0618 | 0.0062 | 0.0271 | 0.0182 | 0.1176 | 0.0992 | |
| 1000 | 0.0339 | 0.0027 | 0.0139 | 0.0078 | 0.1094 | 0.0967 | |
| 20-node ( | 100 | 0.1079 | 0.0205 | 0.0290 | 0.0389 | 0.2275 | 0.0915 |
| 500 | 0.0672 | 0.0083 | 0.0104 | 0.0229 | 0.1728 | 0.0862 | |
| 1000 | 0.0434 | 0.0041 | 0.0038 | 0.0131 | 0.1574 | 0.0807 | |
| 20-node ( | 100 | 0.1265 | 0.0604 | 0.0703 | 0.0389 | 0.1982 | 0.0880 |
| 500 | 0.0724 | 0.0192 | 0.0384 | 0.0207 | 0.1617 | 0.0752 | |
| 1000 | 0.0366 | 0.0025 | 0.0121 | 0.0119 | 0.1426 | 0.0713 |
The distance between the true parameters of and the estimated ones, and the misclassification rates for both the primary () and additional clusters () are presented
Percentage of times the lowest BIC values occurred in each model
| BIC | BIC | BIC | BIC | BIC | BIC | BIC | BIC | BIC | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 28 | 11 | 26 | 14 | 9 | 12 | 0 | 0 | |
| 42 | ||||||||||
| 3 | 1 | 4 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2 | 2 | 3 | 7 | 7 | 4 | 0 | 0 | 0 | 17 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For the first two columns (Column ‘G’ and ‘K’): bold indicates the true model. For the rest of the columns, the largest values are bolded
Percentage of times the lowest BIC values occurred in each model
| BIC | BIC | BIC | BIC | BIC | BIC | BIC | BIC | BIC | ||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 12 | 2 | 27 | 4 | 0 | 13 | 0 | 0 | |
| 2 | 2 | 28 | 13 | 1 | 9 | 0 | 2 | 0 | ||
| 3 | 1 | 14 | 29 | 19 | 9 | 14 | 7 | 0 | 0 | 0 |
| 15 | 22 | 29 | ||||||||
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 |
| 4 | 2 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 5 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
For the first two columns (Column ‘G’ and ‘K’): bold indicates the true model. For the rest of the columns, the largest values are bolded
Proportion of times that the true model can be recovered
| True model | Fitted model | RR ( | RR ( | ||
|---|---|---|---|---|---|
| 10 | 50 | 0.55 | 0.83 | ||
| 100 | 0.57 | 0.83 | |||
| 500 | 0.66 | 0.91 | |||
| 20 | 50 | 0.64 | 0.85 | ||
| 100 | 0.66 | 0.83 | |||
| 500 | 0.72 | 0.88 | |||
| 10 | 50 | 0.32 | 0.51 | ||
| 100 | 0.27 | 0.59 | |||
| 500 | 0.33 | 0.62 | |||
| 20 | 50 | 0.56 | 0.78 | ||
| 100 | 0.55 | 0.86 | |||
| 500 | 0.59 | 0.83 | |||
| 10 | 50 | 0.55 | 0.77 | ||
| 100 | 0.53 | 0.80 | |||
| 500 | 0.52 | 0.78 | |||
| 20 | 50 | 0.70 | 0.93 | ||
| 100 | 0.67 | 0.92 | |||
| 500 | 0.66 | 0.89 | |||
| 10 | 50 | 0.43 | 0.65 | ||
| 100 | 0.36 | 0.63 | |||
| 500 | 0.33 | 0.64 | |||
| 20 | 50 | 0.54 | 0.71 | ||
| 100 | 0.49 | 0.76 | |||
| 500 | 0.53 | 0.75 |
The recovery rates corresponding to and for each simulation setting are shown
Model selection for the Star Wars data set
| No. of clusters | No. of Additional clusters | BIC |
|---|---|---|
| 1 | 1 | 1298.08 |
| 1 | 2 | 1437.86 |
| 2 | 1 | 1269.11 |
| 2 | 2 | 1271.55 |
| 3 | 1 | 1270.46 |
| 3 | 2 | |
| 3 | 3 | 1280.81 |
| 4 | 1 | 1273.54 |
| 4 | 2 | 1284.68 |
| 5 | 1 | 1307.05 |
| 5 | 2 | 1298.11 |
| 5 | 3 | 1306.50 |
The smallest value is bolded
Estimates of , and a from fitting the ELCA model with 3 clusters and 2 additional clusters for the Star Wars data set
| (0.40, 0.40, 0.20) | |
| (0.81, 0.19) | |
| (0.41, 1.00) |
Estimates of from fitting the ELCA model with 3 clusters and 2 additional clusters for the Star Wars data set
| Character | Cluster 1 | Cluster 2 | Cluster 3 |
|---|---|---|---|
| Wedge | 0.18 | 0.00 | 0.36 |
| Han | 0.00 | 1.00 | 0.00 |
| Luke | 1.00 | 1.00 | 0.00 |
| C-3PO | 0.75 | 0.30 | 0.00 |
| Obi-Wan | 0.00 | 0.00 | 1.00 |
| Leia | 0.12 | 0.48 | 0.07 |
| Biggs | 0.31 | 0.00 | 0.28 |
| Darth Vader | 0.19 | 0.35 | 0.06 |
Fig. 3Probability of primary clusters for movie scenes in Star Wars data set plotted against movie scene number for the ELCA model with 3 primary clusters and 2 additional clusters. Cluster 1 is associated with scenes in the first half of the movie, whereas cluster 2 contains scenes mostly in the middle of the movie. On the other hand, scenes occuring in the second half of the movie are slightly more likely to be associated with cluster 3 compared to scenes ocurring in the first half of the movie
Fig. 4Ternary plot of the a posteriori group membership probabilities for the scenes in the Star Wars data set
Fig. 5Probability of additional clusters for movie scenes in Star Wars data set plotted against movie scene number for the ELCA model with 3 primary clusters and 2 additional clusters. Majority of movie scenes are in cluster 1 whereas very few scenes are in cluster 2
Estimates of from fitting the LCA model with 3 clusters for the Star Wars data set
| (0.17, 0.61, 0.22) |
Estimates of from fitting the LCA model with 3 clusters for the Star Wars data set
| Character | Cluster 1 | Cluster 2 | Cluster 3 |
|---|---|---|---|
| Wedge | 0.47 | 0.00 | 0.00 |
| Han | 0.00 | 0.40 | 0.00 |
| Luke | 0.23 | 0.74 | 0.00 |
| C-3PO | 0.00 | 0.24 | 0.38 |
| Obi-Wan | 0.00 | 0.00 | 0.60 |
| Leia | 0.00 | 0.21 | 0.04 |
| Biggs | 0.52 | 0.02 | 0.00 |
| Darth Vader | 0.00 | 0.18 | 0.03 |
Contingency table: ELCA with 3 clusters and 2 additional clusters versus LCA with 3 clusters
| LCA | |||
|---|---|---|---|
| ELCA | 1 | 2 | 3 |
| 1 | 16 | 47 | 22 |
| 2 | 0 | 57 | 0 |
| 3 | 12 | 0 | 24 |
Fig. 6Probability of clusters for movie scenes in Star Wars data set plotted against movie scene number for the LCA model with 3 clusters. Movie scenes in cluster 1 mostly ocurred in the second half of the movie, whereas cluster 2 contains majority of the scenes in the movie. On the other hand, scenes in the first half of the movie are slightly more likely to be assoiated with cluster 3 compared to scenes in the second half of the movie
Model selection for Reuters News data set
| No. of clusters | No. of additional clusters | BIC |
|---|---|---|
| 1 | 1 | 18,018 |
| 1 | 2 | 19,005 |
| 2 | 1 | 17,801 |
| 2 | 2 | 17,711 |
| 2 | 3 | 17,723 |
| 3 | 1 | 17643 |
| 3 | 2 | 17636 |
| 3 | 3 | 17652 |
| 4 | 1 | 17562 |
| 4 | 2 | 17533 |
| 4 | 3 | 17625 |
| 5 | 1 | 17507 |
| 5 | 2 | |
| 5 | 3 | 17611 |
| 6 | 1 | 17468 |
| 6 | 2 | 17489 |
| 7 | 1 | 17514 |
| 7 | 2 | 17526 |
The smallest value is bolded
Estimates of , and a from fitting the ELCA model with 5 clusters and 2 additional clusters for Reuters News data set
| (0.16, 0.27, 0.19, 0.12, 0.26) | |
| (0.94, 0.06) | |
| (0.28, 1.00) |
Estimates of from fitting the ELCA model with 5 clusters and 2 additional clusters for the Reuters News data set
| Country | Cluster 1 | Cluster 2 | Cluster 3 | Cluster 4 | Cluster 5 |
|---|---|---|---|---|---|
| BRA | 0.19 | 0.27 | 0.00 | 0.42 | 0.00 |
| CAN | 0.00 | 0.27 | 0.00 | ||
| CHN | 0.46 | 0.62 | 0.79 | ||
| DEU | 0.00 | 0.49 | 0.38 | 0.19 | |
| FRA | 0.00 | 0.80 | 0.00 | ||
| GBR | 0.39 | 0.79 | 0.32 | ||
| IND | 0.66 | 0.21 | 0.10 | 0.45 | 0.04 |
| ITA | 0.00 | 0.29 | 0.00 | 0.13 | 0.44 |
| JPN | 0.12 | 0.00 | 0.00 | 0.05 | |
| MEX | 0.00 | 0.01 | 0.04 | 0.00 | |
| RUS | 0.18 | 0.14 | 0.10 | 0.60 | |
| USA | 0.35 | 0.47 | |||
| ZAF | 0.20 | 0.03 | 0.00 | 0.04 | 0.01 |
The largest three values in each column are bolded