| Literature DB >> 22348444 |
Leonid Chindelevitch1, Po-Ru Loh, Ahmed Enayetallah, Bonnie Berger, Daniel Ziemek.
Abstract
BACKGROUND: Causal graphs are an increasingly popular tool for the analysis of biological datasets. In particular, signed causal graphs--directed graphs whose edges additionally have a sign denoting upregulation or downregulation--can be used to model regulatory networks within a cell. Such models allow prediction of downstream effects of regulation of biological entities; conversely, they also enable inference of causative agents behind observed expression changes. However, due to their complex nature, signed causal graph models present special challenges with respect to assessing statistical significance. In this paper we frame and solve two fundamental computational problems that arise in practice when computing appropriate null distributions for hypothesis testing.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22348444 PMCID: PMC3307026 DOI: 10.1186/1471-2105-13-35
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Illustration of the causal graph methodology. Schematic depiction of a set of relationships curated from the literature and transformed into a causal graph, used to explain gene expression data.
Figure 2Scoring of an example hypothesis. Illustration of scoring for the KLF4+ hypothesis based on the experimental dataset discussed in the main text. Arrows illustrate predicted upregulation or downregulation of all experimentally regulated transcripts one step downstream of KLF4. In this case, all predictions match with experimental observations, resulting in 9 correct and 0 incorrect predictions and a corresponding score of 9.
Top hypotheses by score and corresponding p-values on an example dataset
| Rank | Hypothesis Name | Correct | Incorrect | Score | Ternary Dot Product | Causal Graph |
|---|---|---|---|---|---|---|
| 1 | Response to Hypoxia+ | 48 | 9 | 37 | 2 × 10-12 | < 0.001 |
| 2 | Dexamethasone+ | 20 | 4 | 16 | 6 × 10-6 | < 0.001 |
| 3 | Hydrocortisone+ | 17 | 4 | 13 | 1 × 10-8 | < 0.001 |
| 4 | PGR+ | 12 | 1 | 11 | 6 × 10-8 | < 0.001 |
| 5 | SRF+ | 10 | 0 | 10 | 3 × 10-5 | < 0.001 |
| 6 | KLF4+ | 9 | 0 | 9 | 3 × 10-6 | < 0.001 |
| 7 | NR3C1+ | 12 | 4 | 8 | 7 × 10-4 | < 0.001 |
| 7 | Glucocorticoid+ | 12 | 4 | 8 | 8 × 10-5 | < 0.001 |
| 7 | CCND1+ | 9 | 1 | 8 | 3 × 10-4 | < 0.001 |
| 7 | Triamcinolone acetonide+ | 8 | 0 | 8 | 9 × 10-7 | < 0.001 |
| ... | ... | ... | ... | ... | ... | ... |
| 17 | NRF2+ | 9 | 4 | 5 | 0.18 | 0.07 |
Top hypotheses by score in an example experimental dataset of dexamethasone-stimulated chondrocytes (GEO accession GSE7683 [21]). Each hypothesis is scored by the difference between the numbers of correct and incorrect predictions. Significance is assessed by the Ternary Dot Product and Causal Graph Randomization p-values discussed in the text; the latter numbers are estimates based on 1000 runs of graph randomization and for this reason are always a multiple of 0.001. When no randomized graph with a better score for the given hypothesis is detected, we indicate that as "p < 0.001." Note that hypotheses with the same numbers of correct and incorrect predictions do not necessarily have the same p-values because the significance calculation takes into account the full contingency table for each hypothesis; some hypotheses result in more predicted regulations than others.
Contingency table comparing predicted and experimental classifications
Contingency table of predicted and experimental classifications. The columns sum to n+, n-, and n0, the numbers of predicted classifications of each type, and the rows sum to q+, q-, and q0, the numbers of experimental classifications of each type.
Figure 3Pseudocode for Ternary Dot Product algorithms. Pseudocode for algorithms computing the Ternary Dot Product Distribution using thresholding on families of contingency tables.
Run times for Ternary Dot Product Distribution algorithm
| Quartic algorithm:compute all | Thresholded algorithm | |
|---|---|---|
| 8 | 0.05 | 0.07 s |
| 16 | 0.19 | 0.15 s |
| 32 | 0.92 | 0.36 s |
| 64 | 6.16 | 0.61 s |
| 128 | 53.15 | 2.35 s |
| 256 | 689.18 | 5.93 s |
| 512 | 7864.20 | 19.54 s |
| 1024 | > 1 d | 85.76 s |
Run time comparison of simple quartic Ternary Dot Product Distribution algorithm to thresholded version for an increasing family of problems with (n+, n-, n0, q+, q-) in the ratio (1, 1, 50, 2, 1), a typical usage scenario. Runs were performed on a 3.0 GHz Intel Xeon processor with 2 MB cache.
Figure 4Computational complexity of Ternary Dot Product algorithms. Counts of the numbers of D-values computed by the simple quartic algorithm and during the thresholding part of the 2 × 2- and 3 × 2-family algorithms. Solid lines indicate total counts while corresponding dotted Lines indicate the numbers of contingency tables (respectively families) that pass the ϵDmax threshold. The left panel shows a "dense" case n0 = 5n+ while the right panel shows a "sparse" case n0 = 50n+. For these examples we set n+ = n- = q+ = q- and chose ϵ = 10-16.
Figure 5Two obstacles to randomization of signed directed graphs. A strong quadrilateral and a strong triangle. Solid lines indicate positive edges and dotted lines indicate negative edges.
Figure 6Flipping a strong triangle using auxiliary edges. The sequence of same-sign edge switches and triangle flips that flips a strong triangle: (1) Opening, (2) Flipping, (3) Closing, and (4) Restoring. Solid lines indicate positive edges and dotted lines indicate negative edges.
Statistics from runs of Causal Graph Randomization algorithm
| Structure | Occurrence rate |
|---|---|
| Strong quadrilateral | 3.76 × 10-4 |
| Flippable triangle | 1.22 × 10-6 |
| Strong triangle | 2.44 × 10-9 |
Rates of occurrence of local graph structures in 79 runs of the randomization algorithm on our test graph. A total of 5.3 billion iterations were performed during these runs.