| Literature DB >> 27330244 |
Michail Papathomas1, Sylvia Richardson2.
Abstract
This manuscript is concerned with relating two approaches that can be used to explore complex dependence structures between categorical variables, namely Bayesian partitioning of the covariate space incorporating a variable selection procedure that highlights the covariates that drive the clustering, and log-linear modelling with interaction terms. We derive theoretical results on this relation and discuss if they can be employed to assist log-linear model determination, demonstrating advantages and limitations with simulated and real data sets. The main advantage concerns sparse contingency tables. Inferences from clustering can potentially reduce the number of covariates considered and, subsequently, the number of competing log-linear models, making the exploration of the model space feasible. Variable selection within clustering can inform on marginal independence in general, thus allowing for a more efficient exploration of the log-linear model space. However, we show that the clustering structure is not informative on the existence of interactions in a consistent manner. This work is of interest to those who utilize log-linear models, as well as practitioners such as epidemiologists that use clustering models to reduce the dimensionality in the data and to reveal interesting patterns on how covariates combine.Entities:
Keywords: Bayesian model selection; Graphical models; Sparse contingency tables
Year: 2016 PMID: 27330244 PMCID: PMC4896165 DOI: 10.1016/j.jspi.2016.01.002
Source DB: PubMed Journal: J Stat Plan Inference ISSN: 0378-3758 Impact factor: 1.111
Cluster profiles in hypothetical simple illustration, defined by the multinomial probabilities, for covariate and cluster .
| Cluster 1 | (0.01,0.3,0.69) | (0.01,0.3,0.69) | (0.1,0.1,0.8) | (0.1,0.1,0.8) | (0.8,0.1,0.1) | (0.8,0.1,0.1) |
| Cluster 2 | (0.01,0.5,0.49) | (0.01,0.5,0.49) | (0.8,0.1,0.1) | (0.8,0.1,0.1) | (0.8,0.1,0.1) | (0.8,0.1,0.1) |
| Cluster 3 | (0.29,0.7,0.01) | (0.29,0.7,0.01) | (0.8,0.1,0.1) | (0.8,0.1,0.1) | (0.8,0.1,0.1) | (0.8,0.1,0.1) |
Summary cluster profiles in hypothetical simple illustration. The ‘<’ (‘>’) symbol denotes that observation of covariate in cluster is more (less) likely compared to the average in the whole sample; otherwise, the ‘0’ symbol is used.
| 0.8 | 0.8 | 0.9 | 0.9 | 0.001 | 0.001 | |
| Cluster 1 | 000 | 000 | ||||
| Cluster 2 | 000 | 000 | ||||
| Cluster 3 | 000 | 000 |
Simulation specifications.
| Number of subjects | Number of covariates | Number of levels of covariates | Number of cells in contingency table | Approximate number of models | Number of covariates that form interactions | |
|---|---|---|---|---|---|---|
| Simulation 1 | 10,000 | 10 | 2 | 1024 | 3.5184× 1013 | 7 |
| Simulation 2 | 10,000 | 10 | 2 | 1024 | 3.5184× 1013 | 6 |
| Simulation 3 | 10,000 | 10 | 2 | 1024 | 3.5184× 1013 | 9 |
| Simulation 4 | 5,000 | 20 | 3 | 3.4×109 | 1.5×1057 | 6 |
| Simulation 5 | 10,000 | 100 | 2 | 1.27×1030 | 24950 | 8 |
Fig. 1The graphical models used for the five simulations.
MCMC specifications for the clustering analyses, and also for the log-linear model comparison Reversible jump chains. Clustering analyses were performed using the R package PReMiuM. Reversible jump analyses were performed using Matlab code. All analyses performed on a PC equipped with an Intel(R) Core(TM)i7-2600K CPU 3.40 GHz with 8GB RAM.
| Clustering algorithms | ||||
| Burn-in | Iterations after burn-in | Run time in minutes (approx.) | Comment | |
| Simulation 1 | 40,000 | 20,000 | 24 | |
| Simulation 2 | 40,000 | 20,000 | 24 | |
| Simulation 3 | 40,000 | 20,000 | 24 | |
| Simulation 4 | 100,000 | 20,000 | 30 | |
| Simulation 5 | 100,000 | 20,000 | 90 | |
| Edwards and Havranek data (CHD) | 40,000 | 20,000 | 3 | |
| Genetic-environmental data | 40,000 | 20,000 | 10 | |
| Reversible jump chains | ||||
| Burn-in | Iterations | Run time in minutes | Comment | |
| Simulation 1 | 10,000 | 100,000 | 420 | |
| Simulation 2 | 10,000 | 100,000 | 420 | |
| Simulation 3 | 10,000 | 100,000 | 420 | |
| Simulation 4 | 2,000 | 10,000 | 360 | after discarding 14 covariates |
| Simulation 5 | 50,000 | 106 | 240 | after discarding 92 covariates |
| Edwards and Havranek data (CHD) | 20,000 | 106 | 65 | |
| Genetic-environmental data | 20,000 | 106 | 65 | after discarding 18 SNPs |
Cluster profiles for the five simulations. In parenthesis the number of subjects typically allocated to each representative cluster. All posterior median selection probabilities for the remaining 14 covariates in Simulation 4 were less than 0.14. Posterior median selection probabilities for the remaining 92 covariates in Simulation 5 were either equal to zero or smaller than 0.01.
| Simulation 1 | ||||||||||
| A | B | C | D | E | F | G | H | I | J | |
| 0.36 | 0.78 | 0.32 | 0.75 | 0.06 | 0.05 | 0.00 | 0.48 | 0.57 | 0.50 | |
| Cluster 1 (5465) | 00 | 00 | 00 | 00 | ||||||
| Cluster 2 (3159) | 00 | 00 | 00 | 00 | ||||||
| Cluster 3 (1376) | 00 | 00 | 00 | 00 | 00 | 00 | ||||
| Simulation 2 | ||||||||||
| A | B | C | D | E | F | G | H | I | J | |
| 0.63 | 0.38 | 0.35 | 0.53 | 0.00 | 0.50 | 0.51 | 0.16 | 0.07 | 0.09 | |
| Cluster 1 (1153) | 00 | 00 | 00 | 00 | 00 | 00 | 00 | 00 | ||
| Cluster 2 (1926) | 00 | 00 | 00 | 00 | 00 | |||||
| Cluster 3 (2031) | 00 | 00 | 00 | 00 | ||||||
| Cluster 4 (2466) | 00 | 00 | 00 | 00 | ||||||
| Cluster 5 (2424) | 00 | 00 | 00 | 00 | 00 | |||||
| Simulation 3 | ||||||||||
| A | B | C | D | E | F | G | H | I | J | |
| 0.38 | 0.50 | 0.30 | 0.54 | 0.07 | 0.34 | 0.49 | 0.41 | 0.43 | 0.66 | |
| Cluster 1 (7676) | 00 | 00 | 00 | 00 | 00 | |||||
| Cluster 2 (2324) | 00 | 00 | 00 | 00 | 00 | |||||
| Simulation 4 | ||||||||||
| A | B | C | D | E | F | |||||
| 0.92 | 0.87 | 0.97 | 0.56 | 0.70 | 0.46 | |||||
| Cluster 1 (2986) | ||||||||||
| Cluster 2 (306) | 000 | 000 | ||||||||
| Cluster 3 (700) | ||||||||||
| Cluster 4 (260) | ||||||||||
| Cluster 5 (354) | 000 | 000 | ||||||||
| Cluster 6 (394) | 000 | |||||||||
| Simulation 5 | ||||||||||
| A | B | C | D | E | F | G | H | |||
| 0.96 | 0.95 | 0.97 | 0.93 | 0.97 | 0.96 | 0.97 | 0.96 | |||
| Cluster 1 (4036) | ||||||||||
| Cluster 2 (3813) | ||||||||||
| Cluster 3 (399) | 00 | |||||||||
| Cluster 4 (720) | ||||||||||
| Cluster 5 (902) | ||||||||||
| Cluster 5 (130) | ||||||||||
| Edwards and Havranek data (CHD) | ||||||||||
| A | B | C | D | E | F | |||||
| 0.86 | 0.92 | 0.94 | 0.26 | 0.81 | 0.10 | |||||
| Cluster 1 (900) | 00 | 00 | ||||||||
| Cluster 2 (941) | 00 | 00 | ||||||||
| Genetic-environmental data (GE) | ||||||||||
| rs8034191 (A) | rs4324798 (B) | rs1950081 (C) | age (D) | sex (E) | smoking (F) | |||||
| 0.01 | 0.00 | 0.10 | 0.92 | 0.82 | 0.85 | |||||
| Cluster 1 (2222) | 00 | 00 | 00 | |||||||
| Cluster 2 (2059) | 00 | 00 | 00 | |||||||
Fig. 2The resulting best models from the five simulations.
Mixing performance of samplers. Median of iterations to best model is calculated after 30 runs of the reversible jump MCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in Papathomas et al. (2011a). See Fig. 2 for the highest posterior probability model.
| Acceptance rate as a percentage | Iterations (median) to highest posterior probability model | Posterior probability for highest probability model | |
|---|---|---|---|
| Simulation 1 | |||
| (a) Uniformly random (PDV) | 5.1 | 590 (452,821) | 0.55 |
| (b) Cluster specific | 3.8 | 247 (164,369) | 0.55 |
| (c) Combined (30%,10%) | 5.3 | 540 (290,674) | 0.53 |
| (d) Combined (20%,20%) | 4.9 | 403 (312,493) | 0.55 |
| Simulation 2 | |||
| (a) Uniformly random (PDV) | 4.4 | 717 (475,990) | 0.60 |
| (b) Cluster specific | 4.4 | 189 (147,238) | 0.58 |
| (c) Combined (30%,10%) | 4.4 | 417 (346,354) | 0.60 |
| (d) Combined (20%,20%) | 4.5 | 257 (181,314) | 0.59 |
| Simulation 3 | |||
| (a) Uniformly random (PDV) | 3.2 | 657 (545,1065) | 0.62 |
| (b) Cluster specific | 3.1 | 445 (335,592) | 0.60 |
| (c) Combined (30%,10%) | 3.3 | 538 (431,701) | 0.60 |
| (d) Combined (20%,20%) | 3.2 | 560 (368,815) | 0.61 |
| Simulation 4 (considering only the 6 important covariates) | |||
| (a) Uniformly random | 2.2 | 661 (550,746) | 0.55 |
| (b) Cluster specific | 2.08 | 685 (534,1015) | 0.49 |
| (c) Combined (30%,10%) | 2.5 | 625 (543,806) | 0.42 |
| (d) Combined (20%,20%) | 2.2 | 733 (551,947) | 0.62 |
| Simulation 5 (considering only the 8 important covariates) | |||
| Any of the 4 equivalent strategies | 1.1 | 5183 (3711,6590) | 0.74 |
Mixing performance of samplers. Median of iterations to best model is calculated after 300 runs of the reversible jump MCMC chain. First and third quartiles are given in parentheses. PDV denotes the unrefined model search strategy adopted in Papathomas et al. (2011a).
| Edwards and Havranek data (CHD) | |||
| Acceptance rate as a percentage | Iterations (median) to highest posterior probability model | Posterior probability for highest probability model ‘ADE+AC+BC+ BE+F’ | |
| (a) Uniformly random (PDV) | 5.2 | 314 (215,582) | 0.28 |
| (b) Cluster specific | 3.7 | 244 (162,378) | 0.28 |
| (c) Combined (30%,10%) | 4.9 | 273 (172,470) | 0.27 |
| (d) Combined (20%,20%) | 4.6 | 248 (155,392) | 0.28 |
| Genetic-environmental data [including important (characterized as such by clustering) representative SNPs] | |||
| Acceptance rate as a percentage | Iterations (median) to highest posterior probability model | Posterior probability for highest probability model ‘A+B+C+DEF’ | |
| (a) Uniformly random | 6.3 | 564 (257,1205) | 0.53 |
| (b) Cluster specific | 8.4 | 196 (83,443) | 0.51 |
| (c) Combined (30%,10%) | 6.9 | 310 (147,670) | 0.51 |
| (d) Combined (20%,20%) | 7.5 | 235 (91,516) | 0.52 |