| Literature DB >> 19287484 |
Richard A Mushlin1, Stephen Gallagher, Aaron Kershenbaum, Timothy R Rebbeck.
Abstract
BACKGROUND: Commonly-occurring disease etiology may involve complex combinations of genes and exposures resulting in etiologic heterogeneity. We present a computational algorithm that employs clique-finding for heterogeneity and multidimensionality in biomedical and epidemiological research (the "CHAMBER" algorithm). METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2009 PMID: 19287484 PMCID: PMC2653643 DOI: 10.1371/journal.pone.0004862
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of the Bi-clique-Finding Algorithm.
Step 1 involves the construction of bipartite graph to identify all relationships between nodes (Figure 1, Phase I). In Step 2, the algorithm undertakes maximal bi-clique formation by exhaustively searching the entire space of all genotype combinations to identify an initial set of maximal bi-cliques (Figure 1, Phase II). In the third step, a Figure of merit (FOM) is generated to prioritize “interesting” bi-cliques (Figure 1, Phase II). The FOM can be any measure inherent to the data. Here, we consider values of features (e.g., genotypes) in a 2×2 contingency Table with affected cases and unaffected controls contingent on exposure (e.g., genotype). In the fourth step, a “lattice” is built by connecting each pair of bi-cliques to their least upper bound and their greatest lower bound using principles of set union and intersection. (Figure 1, Phase III). In the fifth step, the bi-cliques of greatest interest are identified using a parsimony principle by which “optimal” bi-cliques should contain the most parsimonious set of features, and the addition of more features does not substantially improve the FOM. To achieve this, we employ the set covering approach[33] (Appendix S1).
Figure 2Distribution of P-values and Odds Ratios for Four Simulated Datasets.
Designated patterns in D2–D4 are shown as large filled glyphs. Dataset D1 was modeled to have no factors that confer risk of being a case vs. a control. Datasets D2 and D3 contain a 2-gene and a 4-gene risk pattern respectively. Dataset D4 simulated the situation of etiologic heterogeneity in which disease risk was conferred by different patterns in different subsamples. The list of all discovered patterns was filtered to include only those with support>5% of cases, odds ratio>1, and P-value<0.05. P-value was used as the FOM. Note that adding even a single high risk genotype (D2, D3) results in many patterns above the noise level (D1).
Figure 3Dataset D2 partitioned by the 2 genes in the designated pattern for cases (inner band) and controls (outer band).
The solid white sector represents the single feature G03 without G05. The checkered sector represents G03 with G05. So the checkered and white sector together represent all the people with G03. One can see that generalizing the description of the risky pattern from G03 and G05 to simply G03 identifies all the people with the high risk 2-gene pattern, while picking up only a small fraction of low risk false positives. Frequencies are rounded to 1%, and the “∼” symbol represents logical “not”.
Relationships among top-ranking Bi-cliques from Simulated Dataset D2.
| Rank | G01 = 0.783 | G03 = 0.0782 | G05 = 0.8388 | G08 = 0.8821 | Fisher's Exact Test P-value |
| 1 | T | T | 5.59×10−7 | ||
| 2 | S | S | S | 7.55×10−7 | |
| 3 | T | T | T | 1.07×10−6 | |
| 4 | S | S | S | S | 1.72×10−6 |
| 5 | T | T | 2.11×10−6 | ||
| 6 | T | 2.59×10−6 | |||
| 7 | S | S | S | 2.62×10−6 | |
| 8 | R | R | 3.16×10−6 |
Genes are labeled with their frequencies used for simulating the dataset. The designated high risk pattern, marked R, is ranked 8th. Some specializations of R, marked S, are also high risk. Thus, bi-cliques ranked 2, 4, and 7 are specific instances of bi-clique 8, and include 78%, 69%, and 88%, respectively, of the same individuals as bi-clique 8. All confer an approximately two-fold enhanced risk of disease. These patterns all contain the rare allele (7.8%) for G03, plus common alleles of G01, G05, and G08. Thus, the chance of having the designated genotype pattern if the individual has G03 = 0.0782 is 84%, regardless of the genotypes at the other loci. Stated differently, 84% of the individuals in bi-cliques 1, 3, 5, and 6 have the simulated combination of risk-conferring alleles. G03 is the single gene selected by our set covering algorithm to be the most parsimonious description of all the significant risky patterns. Note that patterns containing G03 but not G05, marked T, involve very common genes combined with G03. This makes the population at risk from these patterns a large subset of the population described by G03 alone. Similar effects are seen in datasets D3 and D4.
Summary of Results of Set Covering Algorithm for Simulated Datasets.
| Dataset | Designated Risk Pattern | Covering Pattern | Coverage | OR | P |
| D2 |
|
| 30/33 (91%) | 2.33 | 2.59E-06 |
| None | 3/33 (9%) | ||||
| D3 |
|
| 9/96 (9%) | 1.98 | 1.06E-04 |
|
| 56/96 (58%) | 1.39 | 3.98E-03 | ||
|
| 8/96 (8%) | 1.78 | 8.42E-03 | ||
|
| 14/96 (15%) | 1.38 | 1.37E-02 | ||
| G08 = 0.8821 | 7/96 (7%) | 1.56 | 3.84E-02 | ||
| None | 2/96 (2%) | ||||
| D4 |
|
| 9/38 (24%) | 1.76 | 2.37E-03 |
|
|
| 24/38 (63%) | 1.44 | 1.29E-02 | |
| None | 5/38 (13%) |
The set covering algorithm was run on the bi-cliques found in the three simulated datasets. The fraction of input patterns covered by each covering pattern is shown. In dataset D2, 30 of the 33 input patterns could be covered by the single pattern G03 = 0.0782. This is consistent with the data in Table 1, where the common thread of G03 was seen in all eight top patterns. The number of interesting patterns in D2 has been reduced from 30 to 1. Dataset D3 has a more complex risk (four genes), and five patterns were needed to cover 94 of the 96 bi-cliques found in D3. Note that the first cover (3 genes, P≈0.0001) could itself be covered by the second cover (1 gene, P≈0.0040) or the fourth cover (two genes, P≈0.0137). However, the cost model (Appendix S1, Step 5) determined that the difference in P values between these was too large to generalize the three-gene cover pattern to a more parsimonious, but less significant, one- or two-gene cover pattern. Dataset D4, with risk from both the D2 and D3 patterns in the same population, is covered by two simpler patterns. Note that the first D4 cover is the same as the D2 cover. The other D4 cover is a simpler version of the top D3 cover. This slight difference is not unexpected since, for reasons discussed in the text and Appendix S3, the odds ratios and P values are different in the heterogeneous population D4 than in the homogeneous populations D2 and D3.
Results of the CHAMBER Algorithm for the Detection of High-Dimensional Combinations: Estrogen Metabolism Genes in a Population-Based Case-Control Study of Breast and Endometrial Cancer.
| Group | Exposed Cases | Exposed Controls | Unexp. Cases | Unexp. Controls | N | OR | P-value |
|
|
|
|
|
|
|
|
|
| 11 | 4 | 146 | 365 | 526 | 6.88 | 0.0005 |
|
| ||||||
| 49 | 71 | 106 | 294 | 520 | 1.91 | 0.0022 |
|
|
| ||||||
| 41 | 59 | 118 | 312 | 530 | 1.84 | 0.0062 |
|
| |||||||
| 57 | 95 | 112 | 292 | 556 | 1.56 | 0.0173 |
| ||||||||
| 15 | 17 | 128 | 333 | 493 | 2.30 | 0.0206 |
|
| |||||||
| 58 | 108 | 115 | 308 | 589 | 1.44 | 0.0403 |
|
|
| ||||||
| 28 | 46 | 131 | 349 | 554 | 1.62 | 0.0441 |
|
|
| ||||||
| 19 | 29 | 108 | 296 | 452 | 1.80 | 0.0471 |
|
|
| ||||||
|
| 53 | 39 | 344 | 482 | 918 | 1.90 | 0.0025 |
|
|
| |||||
| 51 | 38 | 530 | 740 | 1359 | 1.87 | 0.0030 |
| ||||||||
| 78 | 73 | 399 | 589 | 1139 | 1.58 | 0.0060 |
|
| |||||||
| 41 | 34 | 463 | 662 | 1200 | 1.72 | 0.0153 |
|
| |||||||
| 99 | 105 | 378 | 557 | 1139 | 1.39 | 0.0207 |
|
| |||||||
| 115 | 122 | 313 | 435 | 985 | 1.31 | 0.0419 |
|
| |||||||
|
| 13 | 56 | 22 | 221 | 312 | 2.33 | 0.0237 |
|
|
| |||||
|
| 43 | 58 | 388 | 960 | 1449 | 1.83 | 0.0031 |
| |||||||
| 394 | 918 | 21 | 92 | 1425 | 1.88 | 0.0055 |
| ||||||||
| 113 | 210 | 269 | 681 | 1273 | 1.36 | 0.0149 |
|
| |||||||
| 35 | 45 | 285 | 621 | 986 | 1.69 | 0.0182 |
|
|
| ||||||
| 43 | 67 | 297 | 687 | 1094 | 1.48 | 0.0371 |
|
|
AA = African American.
EA = European American.
Figure 4The designated pattern pair in dataset D4 is the highest scoring of all pairs.
One of the components of the designated pattern (filled blue) could not be identified among the individual patterns in dataset D4 (green dots). The same two components (unfilled blue) scored much higher in single risk datasets D2 and D3.
Figure 5Motif suggested by pattern pair “BC” for a 3-gene pattern (“B”) and a 2-gene pattern (“C”) sharing 1 gene in a serial/parallel motif.