| Literature DB >> 23282373 |
Alexander Statnikov1, Mikael Henaff, Nikita I Lytkin, Constantin F Aliferis.
Abstract
BACKGROUND: The discovery of molecular pathways is a challenging problem and its solution relies on the identification of causal molecular interactions in genomics data. Causal molecular interactions can be discovered using randomized experiments; however such experiments are often costly, infeasible, or unethical. Fortunately, algorithms that infer causal interactions from observational data have been in development for decades, predominantly in the quantitative sciences, and many of them have recently been applied to genomics data. While these algorithms can infer unoriented causal interactions between involved molecular variables (i.e., without specifying which one is the cause and which one is the effect), causally orienting all inferred molecular interactions was assumed to be an unsolvable problem until recently. In this work, we use transcription factor-target gene regulatory interactions in three different organisms to evaluate a new family of methods that, given observational data for just two causally related variables, can determine which one is the cause and which one is the effect.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23282373 PMCID: PMC3535696 DOI: 10.1186/1471-2164-13-S8-S22
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
High-level description of the tested causal orientation methods.
| Method | Reference | Key principles | Sufficient assumptions for causally orienting X → Y | Sound |
|---|---|---|---|---|
| [ | Assuming X → Y with Y = f(X) + e1, where X and e1 are independent, there will be no such additive noise model in the opposite direction X ← Y, X = g(Y) + e2, with Y and e2 independent. | • Y = f(X) + e1; | Yes | |
| [ | Assuming X → Y with Y = f2(f1(X) + e1), there will be no such model in the opposite direction X←Y, X = g2(g1(Y) + e2) with Y and e2 independent. | • Y = f2(f1(X) + e1); | Yes | |
| [ | Assuming X→Y with Y = f(X), one can show that the KL-divergence (a measure of the difference between two probability distributions) between P(Y) and a reference distribution (e.g., Gaussian or uniform) is greater than the KL-divergence between P(X) and the same reference distribution. | • Y = f(X) (i.e., there is no noise in the model); | Yes | |
| [ | Assuming X→Y, the least complex description of P(X, Y) is given by separate descriptions of P(X) and P(Y|X). By estimating the latter two quantities using methods that favor functions and distributions of low complexity, the likelihood of the observed data given X→Y is inversely related to the complexity of P(X) and P(Y | X). | • Y = f(X, e); | No | |
| [ | Same as for GPI-MML, except for a different way of estimating P(Y | X) and P(X | Y). | • Y = f(X) + e; | No | |
| [ | Assuming X→Y with Y = f(X,e1), where X and e1 are independent and f is "sufficiently simple", there will be no such model in the opposite direction X←Y, X = g(Y,e2) with Y and e2 independent and g "sufficiently simple". | Same as for GPI-MML. | No | |
| [ | Same as for ANM-MML, except for the different way of estimating P(X) and P(Y). | Same as for ANM-MML. | No | |
| [ | Assuming X→Y, if we fit linear models Y = b2X+e1 and X = b1Y+e2 with e1 and e2 independent, then we will have b1 < b2. | • Y = b2X+e1; | Yes |
The last column indicates whether a method is sound, i.e. it can provably orient a causal structure under its sufficient assumptions. Because causal orientation methodologies are fairly new and not completely characterized, it is possible that proofs of correctness will become available for GPI-MML, ANM-MML, GPI, and ANM-GAUSS. All methods implicitly assume that there are no feedback loops. The noise term in the models is denoted by small "e".
Information about gold standards (GS) used in the study.
| Task | Reference/source | # TFs in GS | # genes in GS | # gene probes for GS genes in gene expression data | # TF-gene interactions | # TF-gene interactions significant at FDR = 0.05 |
|---|---|---|---|---|---|---|
| [ | 140 | 913 | 913 | 1,885 | ||
| [ | 115 | 1,834 | 1,834 | 3,541 | ||
| [ | 1 | 302 | 813 | 813 | ||
| [ | 1 | 1,420 | 3,657 | 3,657 |
"TF" stands for "transcription factor". Statistically significant associations were determined using Fisher's Z-test at 5% FDR in microarray gene expression data (please see text for details).
Information about microarray gene expression datasets used in the study for each gold standard.
| Task name | Reference/source | # samples |
|---|---|---|
| [ | 907 | |
| [ | 530 | |
| [ | 174 | |
| [ | 174 |
Only T-ALL samples were selected for NOTCH1 and RELA in order to match cell population used for creation of the respective gold standard.
An example demonstrating the construction of the response variable for AUC computation
| a) | b) | |||||
|---|---|---|---|---|---|---|
| NOTCH1 | → | ABCF2 | NOTCH1 | → | ABCF2 | |
| NOTCH1 | → | EIF4E | ||||
| NOTCH1 | → | SFRS3 | NOTCH1 | → | SFRS3 | |
| NOTCH1 | → | NUP98 | ||||
| NOTCH1 | → | CYCS | NOTCH1 | → | CYCS | |
| NOTCH1 | → | ZNHIT | ||||
| NOTCH1 | → | ATM | NOTCH1 | → | ATM | |
| NOTCH1 | → | TIMM9 | ||||
A fragment of the gold standard is shown in a). The edges always point from a transcription factor (NOTCH1) to its target gene. 50% of the edges are represented as "transcription factor → gene" and the other 50% as "gene ← transcription factor" in b). This constructs a response variable with positives corresponding to "→" edges (shown in black) and negatives corresponding to "←" edges (shown in red).
Figure 1. a) AUC is computed using real data; b) AUC is computed using random data from the Normal distribution (null distribution) for the same gold standard as used with the real data, and this step is repeated 1,000 times; c) a p-value is calculated by comparing AUCs from the null distribution to the AUC obtained in the real data.
Accuracy of causal orientation
| Method | ECOLI | YEAST | NOTCH1 | RELA |
|---|---|---|---|---|
| ANM | 0.462 | 0.383 | 0.476 | 0.396 |
| PNL | 0.453 | 0.471 | ||
| IGCI (Uniform/Entropy) | 0.427 | |||
| IGCI (Uniform/Integral) | 0.441 | |||
| IGCI (Gaussian/Entropy) | ||||
| IGCI (Gaussian/Integral) | ||||
| GPI-MML | 0.485 | 0.390 | 0.251 | 0.395 |
| ANM-MML | 0.428 | 0.316 | 0.183 | 0.172 |
| GPI | 0.401 | |||
| ANM-GAUSS | 0.480 | 0.483 | 0.462 | |
| LINGAM | 0.469 | 0.451 | 0.367 | 0.387 |
| RANDOM | 0.500 | 0.500 | 0.500 | 0.500 |
For each gold standard (column) dark orange cells correspond to methods that have high values of accuracy, while white cells correspond to methods that have low values of accuracy. Accuracies higher than 0.50 are shown in bold.
Ranks of causal orientation methods for each gold standard
| Method | ECOLI | YEAST | NOTCH1 | RELA |
|---|---|---|---|---|
| ANM | - | - | - | - |
| PNL | - | - | ||
| IGCI (Uniform/Entropy) | - | |||
| IGCI (Uniform/Integral) | - | |||
| IGCI (Gaussian/Entropy) | ||||
| IGCI (Gaussian/Integral) | ||||
| GPI-MML | - | - | - | - |
| ANM-MML | - | - | - | - |
| GPI | - | |||
| ANM-GAUSS | - | - | - | |
| LINGAM | - | - | - | - |
Ranks were computed only for the methods with accuracies greater than 0.50. The lower the rank, the better the accuracy of the causal orientation method for the given gold standard. The computation of rank took into account statistical variability, e.g. accuracies 0.647 and 0.645 obtained by the two IGCI methods in the ECOLI gold standard are statistically indistinguishable; that is why they have the same rank value.
AUC of causal orientation
| Method | ECOLI | YEAST | NOTCH1 | RELA |
|---|---|---|---|---|
| ANM | 0.464 | 0.379 | 0.456 | 0.369 |
| PNL | 0.443 | 0.464 | ||
| IGCI (Uniform/Entropy) | 0.409 | |||
| IGCI (Uniform/Integral) | 0.437 | |||
| IGCI (Gaussian/Entropy) | ||||
| IGCI (Gaussian/Integral) | ||||
| GPI-MML | 0.488 | 0.370 | 0.184 | 0.333 |
| ANM-MML | 0.393 | 0.237 | 0.078 | 0.071 |
| GPI | 0.396 | |||
| ANM-GAUSS | 0.474 | 0.476 | 0.446 | |
| LINGAM | 0.462 | 0.463 | 0.362 | 0.392 |
| RANDOM | 0.500 | 0.500 | 0.500 | 0.500 |
For each gold standard (column) dark orange cells correspond to methods that have high values of AUC, while white cells correspond to methods that have low values of AUC. AUCs higher than 0.50 are shown in bold.
Ranks of causal orientation methods for each gold standard
| Method | ECOLI | YEAST | NOTCH1 | RELA |
|---|---|---|---|---|
| ANM | - | - | - | - |
| PNL | - | - | ||
| IGCI (Uniform/Entropy) | - | |||
| IGCI (Uniform/Integral) | - | |||
| IGCI (Gaussian/Entropy) | ||||
| IGCI (Gaussian/Integral) | ||||
| GPI-MML | - | - | - | - |
| ANM-MML | - | - | - | - |
| GPI | - | |||
| ANM-GAUSS | - | - | - | |
| LINGAM | - | - | - | - |
Ranks were computed only for the methods with AUCs greater than 0.50. The lower the rank, the better the AUC of the causal orientation method for the given gold standard. The computation of rank took into account statistical variability, e.g. the AUCs of 0.724 and 0.713 obtained by the two IGCI methods in the ECOLI gold standard are statistically indistinguishable; that is why they have the same rank value.
Figure 2. Error bars denote 80% intervals of variation that were empirically estimated in 100 datasets for each value of the noise proportion.
Figure 3. Error bars denote 80% intervals of variation that were empirically estimated in 100 datasets for each value of the noise proportion.
Figure 4. Error bars denote 80% intervals of variation that were empirically estimated in 20 datasets for each value of the noise proportion.
Figure 5. Error bars denote 80% intervals of variation that were empirically estimated in 20 datasets for each value of the noise proportion.
Figure 6. Error bars denote 80% intervals of variation that were empirically estimated in 100 sampled datasets of each sample size.
Figure 7. Error bars denote 80% intervals of variation that were empirically estimated in 100 sampled datasets of each sample size.
Figure 8. Error bars denote 80% intervals of variation that were empirically estimated in 20 sampled datasets of each sample size.
Figure 9. Error bars denote 80% intervals of variation that were empirically estimated in 100 sampled datasets of each sample size.
Ensemble causal orientation results and comparison with the best performing individual causal orientation methods
| ECOLI | YEAST | NOTCH1 | RELA | |
|---|---|---|---|---|
| Best individual causal orientation method (AUC) | 0.828 | 0.658 | 0.926 | 0.970 |
| Ensemble method (AUC) | 0.837 | 0.822 | 0.984 | 0.992 |
| Improvement (AUC) | ||||
| Statistical significance of improvement (p-value) | 0.3407 |
Bold p-values indicate a statistically significant performance improvement by using an ensemble causal orientation. The p-values were obtained from Delong's test for AUC comparison [45].
Coefficients for the ensemble logistic regression model trained in the YEAST gold standard
| Method (feature in the logistic regression model) | Beta | P-value |
|---|---|---|
| ANM | -1.20 | 0.291 |
| PNL | -0.27 | 0.750 |
| IGCI (Uniform/Entropy) | ||
| IGCI (Uniform/Integral) | ||
| IGCI (Gaussian/Entropy) | ||
| IGCI (Gaussian/Integral) | ||
| GPI-MML | 1.15 | 0.578 |
| ANM-MML | ||
| GPI | 1.45 | 0.298 |
| ANM-GAUSS | 0.40 | 0.808 |
| LINGAM | 0.11 | 0.963 |
Bold values correspond to coefficients that are statistically significant at 0.05 alpha level. We note that due to multicollinearity among the IGCI Uniform methods and among the IGCI Gaussian methods, care must be taken when interpreting the logistic regression coefficients [36].