| Literature DB >> 24147066 |
Bärbel Maus1, Camille Jung, Jestinah M Mahachie John, Jean-Pierre Hugot, Emmanuelle Génin, Kristel Van Steen.
Abstract
Complex human diseases commonly differ in their phenotypic characteristics, e.g., Crohn's disease (CD) patients are heterogeneous with regard to disease location and disease extent. The genetic susceptibility to Crohn's disease is widely acknowledged and has been demonstrated by identification of over 100 CD associated genetic loci. However, relating CD subphenotypes to disease susceptible loci has proven to be a difficult task. In this paper we discuss the use of cluster analysis on genetic markers to identify genetic-based subgroups while taking into account possible confounding by population stratification. We show that it is highly relevant to consider the confounding nature of population stratification in order to avoid that detected clusters are strongly related to population groups instead of disease-specific groups. Therefore, we explain the use of principal components to correct for population stratification while clustering affected individuals into genetic-based subgroups. The principal components are obtained using 30 ancestry informative markers (AIM), and the first two PCs are determined to discriminate between continental origins of the affected individuals. Genotypes on 51 CD associated single nucleotide polymorphisms (SNPs) are used to perform latent class analysis, hierarchical and Partitioning Around Medoids (PAM) cluster analysis within a sample of affected individuals with and without the use of principal components to adjust for population stratification. It is seen that without correction for population stratification clusters seem to be influenced by population stratification while with correction clusters are unrelated to continental origin of individuals.Entities:
Mesh:
Year: 2013 PMID: 24147066 PMCID: PMC3798408 DOI: 10.1371/journal.pone.0077720
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Population information in study sample.
| Population based on ancestry informative markers | |||||
|---|---|---|---|---|---|
| Self-reported population | East Asia | Europe | Sub-Saharan Africa | Missing | Total |
| Admixed | 1 | 3 | 1 | 1 | 6 |
| Africa | 0 | 0 | 5 | 1 | 6 |
| Asia | 0 | 3 | 1 | 1 | 5 |
| Europe | 0 | 341 | 1 | 335 | 677 |
| Missing | 2 | 83 | 9 | 57 | 151 |
| Total | 3 | 430 | 17 | 395 | 845 |
P-values for association testing of clusterings with clinical subphenotypes.
|
| Population unadjusted data | Population adjusted data | ||||
|---|---|---|---|---|---|---|
| Latent classes | PAM clusters | Hierarchical clusters | PAM clusters | Hierarchical clusters | ||
|
| 0.2966 | 0.6213 | 0.9644 | 0.3799 | 0.3925 | |
|
| ||||||
| Terminal ileum at diagnosis | 0.0094 | 0.2539 | 0.5280 | 0.9124 | 0.8423 | |
| Terminal ileum at follow-up | 0.0291 | 0.0691 | 0.1674 | 0.8993 | 1.0000 | |
| Colon at diagnosis | 0.9965 | 0.4964 | 0.3616 | 0.5751 | 0.0719 | |
| Colon at follow-up | 0.9774 | 0.2269 | 0.0636 | 0.8984 | 0.0670 | |
|
| ||||||
| Behaviour B1 at diagnosis | 0.0840 | 0.8815 | 0.7644 | 0.0569 | 0.6620 | |
| Behaviour B1 at follow up | 0.5800 | 0.1412 | 0.5824 | 0.6860 | 0.7402 | |
|
| 0.4205 | 0.2270 | 0.2457 | 0.4672 | 0.0538 | |
|
| 0.2318 | 0.9491 | 0.6560 | 0.8194 | 0.7629 | |
|
| 0.4659 | 0.2501 | 0.9623 | 0.3739 | 0.9447 | |
Characteristics of individuals in overall data set and in latent classes.
| Class 1 | Class 2 | Class 3 | Class 4 | Class 5 | Class 6 | Class 7 | Class 8 | Class 9 | Total | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| 407 | 163 | 136 | 33 | 32 | 24 | 18 | 17 | 15 | 845 | |||
|
| |||||||||||||
| <17 | 94 (24%) | 53 (33%) | 45 (33%) | 5 (15%) | 0 (0%) | 9 (38%) | 5 (28%) | 8 (47%) | 0 (0%) | 219 (28%) | |||
| 17-40 | 276 (69%) | 98 (61%) | 79 (59%) | 27 (82%) | 1 (100%) | 14 (58%) | 11 (61%) | 8 (47%) | 1 (100%) | 515 (65%) | |||
| >=40 | 30 (8%) | 10 (6%) | 11 (8%) | 1 (3%) | 0 (0%) | 1 (4%) | 2 (11%) | 1 (6%) | 0 (0%) | 56 (7%) | |||
|
| |||||||||||||
| Terminal ileum at diagnosis | 251 (65%) | 122 (77%) | 99 (75%) | 22 (67%) | 1 (100%) | 22 (92%) | 14 (82%) | 8 (53%) | 1 (100%) | 540 (71%) | |||
| Terminal ileum at follow-up | 307 (77%) | 139 (86%) | 116 (85%) | 25 (76%) | 1 (100%) | 24 (100%) | 16 (89%) | 12 (71%) | 1 (100%) | 641 (81%) | |||
| Colon at diagnosis | 286 (74%) | 115 (73%) | 95 (73%) | 25 (78%) | 1 (100%) | 17 (71%) | 13 (76%) | 13 (76%) | 1 (100%) | 566 (74%) | |||
| Colon at follow-up | 330 (83%) | 129 (80%) | 112 (83%) | 26 (81%) | 1 (100%) | 20 (83%) | 15 (88%) | 14 (82%) | 1 (100%) | 648 (82%) | |||
|
| |||||||||||||
| B1 at diagnosis | 313 (80%) | 120 (75%) | 102 (77%) | 26 (79%) | 0 (0%) | 14 (61%) | 12 (67%) | 16 (94%) | 1 (100%) | 604 (78%) | |||
| B1 at follow-up | 187 (48%) | 67 (42%) | 53 (40%) | 14 (42%) | 0 (0%) | 8 (33%) | 8 (44%) | 7 (41%) | 1 (100%) | 345 (45%) | |||
|
| |||||||||||||
| Never | 62 (16%) | 21 (13%) | 19 (15%) | 6 (19%) | 0 (0%) | 4 (17%) | 3 (18%) | 3 (18%) | 0 (0%) | 118 (15%) | |||
| Intermediate | 286 (74%) | 125 (79%) | 99 (76%) | 23 (72%) | 1 (100%) | 19 (79%) | 13 (76%) | 10 (59%) | 0 (0%) | 576 (75%) | |||
| Often | 41 (11%) | 13 (8%) | 12 (9%) | 3 (9%) | 0 (0%) | 1 (4%) | 1 (6%) | 4 (24%) | 1 (100%) | 76 (10%) | |||
|
| 297 (74%) | 125 (77%) | 102 (75%) | 23 (70%) | 1 (100%) | 14 (58%) | 9 (50%) | 13 (76%) | 1 (100%) | 585 (74%) | |||
|
| 196 (49%) | 87 (54%) | 69 (51%) | 17 (52%) | 1 (100%) | 17 (71%) | 11 (61%) | 9 (53%) | 0 (0%) | 407 (51%) | |||
Class 5 contains 31 individuals with missing information on all clinical characteristics. Class 9 contains 14 individuals with missing information on all clinical characteristics. For the other classes, the number of individuals with missing information differs between the clinical characteristics.
Distribution of populations over clusters obtained by latent class analysis.
| Admixed | Africa | Asia | East Asia | Europe | Sub-Saharan Africa | Missing | Total | |
|---|---|---|---|---|---|---|---|---|
| Class 1 | 0 (0%) | 0 (0%) | 1 (0.25%) | 1 (0.25%) | 370 (90.91%) | 4 (0.98%) | 31 (7.62%) | 407 (100%) |
| Class 2 | 1 (0.61%) | 0 (0%) | 0 (0%) | 1 (0.61%) | 146 (89.57%) | 2 (1.23%) | 13 (7.98%) | 163 (100%) |
| Class 3 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 134 (98.53%) | 0 (0%) | 2 (1.47%) | 136 (100%) |
| Class 4 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 31 (93.94%) | 0 (0%) | 2 (6.06%) | 33 (100%) |
| Class 5 | 0 (0%) | 0 (0%) | 0 (0%) | 1 (3.13%) | 24 (75%) | 0 (0%) | 7 (21.88%) | 32 (100%) |
| Class 6 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 24 (100%) | 0 (0%) | 0 (0%) | 24 (100%) |
| Class 7 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 18 (100%) | 0 (0%) | 0 (0%) | 18 (100%) |
| Class 8 | 0 (0%) | 1 (5.88%) | 0 (0%) | 0 (0%) | 4 (23.53%) | 11 (64.71%) | 1 (5.88%) | 17 (100%) |
| Class 9 | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 14 (93.33%) | 0 (0%) | 1 (6.67%) | 15 (100%) |
| Total | 1 (0.12%) | 1 (0.12%) | 1 (0.12%) | 3 (0.36%) | 765 (90.53%) | 17 (2.01%) | 57 (6.75%) | 845 (100%) |
Figure 1Principal components based on 30 ancestry informative markers for 450 individuals.
Figure 2Percentage change in agglomeration criterion (hierarchical cluster analysis using population adjusted SNPs).
Characteristics of individuals in subset of 450 individuals and in hierarchical clusters based on population adjusted SNPs.
| Cluster 1 | Cluster 2 | Cluster 3 | Total | ||
|---|---|---|---|---|---|
|
| 309 | 102 | 39 | 450 | |
|
| |||||
| <17 | 90 (29%) | 25 (25%) | 0 (0%) | 115 (28%) | |
| 17-40 | 194 (63%) | 67 (66%) | 6 (100%) | 267 (64%) | |
| >=40 | 22 (7%) | 10 (10%) | 0 (0%) | 32 (8%) | |
|
| |||||
| Terminal ileum at diagnosis | 209 (71%) | 68 (69%) | 4 (67%) | 281 (70%) | |
| Terminal ileum at follow-up | 249 (81%) | 83 (81%) | 5 (83%) | 337 (81%) | |
| Colon at diagnosis | 219 (74%) | 69 (70%) | 2 (33%) | 290 (72%) | |
| Colon at follow-up | 252 (83%) | 79 (78%) | 3 (50%) | 334 (81%) | |
|
| |||||
| B1 at diagnosis | 226 (76%) | 79 (79%) | 4 (67%) | 309 (77%) | |
| B1 at follow-up | 124 (42%) | 46 (46%) | 2 (33%) | 172 (43%) | |
|
| |||||
| Never | 41 (14%) | 11 (11%) | 1 (17%) | 53 (13%) | |
| Intermediate | 220 (76%) | 80 (80%) | 2 (33%) | 302 (76%) | |
| Often | 30 (10%) | 9 (9%) | 3 (50%) | 42 (11%) | |
|
| 229 (75%) | 78 (76%) | 4 (67%) | 311 (75%) | |
|
| 169 (55%) | 55 (54%) | 3 (50%) | 227 (55%) | |
Cluster 1 contains three individuals with missing information on all clinical characteristics. Cluster 3 contains 33 individuals with information missing on all clinical characteristics. For cluster 2 each individual provides information for at least one clinical characteristic.
Figure 3Percentage change in agglomeration criterion (hierarchical cluster analysis using population unadjusted SNP markers).
Adjusted Rand Indexes between latent class analysis (LCA), PAM clustering and hierarchical clustering (HC) (using population unadjusted or adjusted SNP data).
| Unadjusted | Adjusted | |||||
|---|---|---|---|---|---|---|
| LCA | PAM | HC | PAM | HC | ||
| Unadjusted | LCA | 0.49 | 0.30 | 0.12 | 0.23 | |
| PAM | 0.23 | 0.20 | 0.13 | |||
| HC | 0.04 | 0.54 | ||||
| Adjusted | PAM | 0.04 | ||||
| HC | ||||||