| Literature DB >> 28691305 |
Holly F Ainsworth1, So-Youn Shin2, Heather J Cordell1.
Abstract
Genome wide association studies (GWAS) have been very successful over the last decade at identifying genetic variants associated with disease phenotypes. However, interpretation of the results obtained can be challenging. Incorporation of further relevant biological measurements (e.g. 'omics' data) measured in the same individuals for whom we have genotype and phenotype data may help us to learn more about the mechanism and pathways through which causal genetic variants affect disease. We review various methods for causal inference that can be used for assessing the relationships between genetic variables, other biological measures, and phenotypic outcome, and present a simulation study assessing the performance of the methods under different conditions. In general, the methods we considered did well at inferring the causal structure for data simulated under simple scenarios. However, the presence of an unknown and unmeasured common environmental effect could lead to spurious inferences, with the methods we considered displaying varying degrees of robustness to this confounder. The use of causal inference techniques to integrate omics and GWAS data has the potential to improve biological understanding of the pathways leading to disease. Our study demonstrates the suitability of various methods for performing causal inference under several biologically plausible scenarios.Entities:
Keywords: Bayesian networks; Mendelian randomisation; causal inference; structural equation modelling
Mesh:
Year: 2017 PMID: 28691305 PMCID: PMC5655748 DOI: 10.1002/gepi.22061
Source DB: PubMed Journal: Genet Epidemiol ISSN: 0741-0395 Impact factor: 2.135
Figure 1Possible causal models explaining the relationship between a genetic variant G and two observed traits X and Y. Models (h)–(l) include an unmeasured common enviromental effect E
Details of simulation models for scenarios given in Figure 1
| Simulation model | |||
|---|---|---|---|
| Scenario |
|
|
|
| (a) |
|
| |
| (b) |
|
| |
| (c) |
|
| |
| (d) |
|
| |
| (e) |
|
| |
| (f) |
|
| |
| (g) |
|
| |
| (h) |
|
|
|
| (i) |
|
|
|
| (j) |
|
|
|
| (k) |
|
|
|
| (l) |
|
|
|
The default parameter values are α= 1, β= 1, δ= 1, = 10, = 10, γ= 1, ζ= 1, = 0.3, = 0.3, = 0.3. G is coded as (0, 1, 2) according to the number of minor alleles present at the SNP
Figure 2Results of applying MR and the CIT to simulated data sets. The x‐axis represents the scenario from which the data were simulated. The y‐axis represents the proportion of time (the proportion of replicates where) a causal model was detected ( for MR, and with X the only link between G and Y, for the CIT). Black and grey represent true and false detections, respectively. For MR, we considered detections from simulated data sets with an arrow as true detections. For the CIT, we considered detections from simulated data sets with arrows but no additional link between G and Y as true detections
Results from performing causal inference on simulated data sets
| Simulation model | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Tested model | a | b | c | d | E | f | g | h | i | j | k | l |
| SEM | a |
| 689 | 688 | 689 | 689 | 686 | 685 |
| 1,381 | 1,380 | 282 | 283 |
| b | 504 |
| 402 | 503 |
| 399 | 1,088 |
|
| 553 | 148 | 841 | |
| c | 504 | 400 |
|
| 505 | 1,091 | 398 |
| 552 |
| 840 | 148 | |
| f | 1,090 | 685 | 1,091 | 1,600 | 1,090 |
| 684 | 688 | 282 | 687 |
| 686 | |
| g | 1,089 | 1,091 | 684 | 1,092 | 1,601 | 686 |
| 687 | 686 | 282 | 687 |
| |
| BUF | a |
| 120.93 | 120.7 | 155.97 | 156.18 | 120.84 | 121.06 |
|
|
| 98.80 | 98.78 |
| b | 120.78 |
| 83.68 | 120.94 |
| 83.58 | −0.08 | 83.61 |
| 37.89 | 83.76 | −0.08 | |
| c | 120.95 | 83.64 |
|
| 121.13 | −0.09 | 83.84 | 83.69 | 37.81 |
| −0.08 | 83.71 | |
| m | 35.06 | 37.28 | −0.03 | −0.01 | 35.05 |
| 37.22 | 15.1 | 61.03 | 15.05 |
| 15.06 | |
| n | 35.23 | −0.02 | 37.03 | 35.03 | −0.01 | 37.25 |
| 15.18 | 15.07 | 61.06 | 15.04 |
| |
| DEAL | a |
| −1,359 | −1,360 | −1,378 | −1,379 | −1,343 | −1,343 |
| −2,245 | −2,244 | −1,689 | −1,689 |
| b | −1,254 |
| −1,200 | −1,264 |
| −1,263 | −1,530 | −1,196 |
| −1,821 | −1,620 | −1,954 | |
| c | −1,254 | −1,199 |
|
| −1,263 | −1,530 | −1,196 | −1,618 | −1,821 |
| 1,954 | −1,619 | |
| d |
| −1,011 | −1,012 |
| −1,025 | −1,010 | −1,010 |
|
|
| −1,553 | −1,554 | |
| e |
| −1,011 | −1,012 | −1,025 |
| −1,010 | −1,010 |
|
|
| −1,553 | −1,554 | |
| f | −1,541 | −1,339 | −1,537 | −1,794 | −1,550 |
| −1,339 | −1,880 | −1,693 | −1,889 |
| −1,884 | |
| g | −1,541 | −1,536 | −1,341 | −1,549 | −1,793 | 1,338 |
| −1,880 | −1,888 | −1,693 | −1,883 |
| |
| m | −1,544 | −1,688 | −1,886 | −2,148 | −1,904 | −1,338 | −1,673 | −2,027 | −2,377 | −2,573 |
| −2,019 | |
| n | −1,544 | −1,884 | −1,690 | −1,902 | −2,147 | −1,671 | −1,338 | −2,027 | −2,573 | −2,377 | −2,019 |
| |
| BNLEARN | a |
| −1,322 | −1,323 | −1,323 | −1,321 | −1,322 | −1,320 |
| −2,214 | −2,215 | −1,667 | −1,668 |
| b | −1,230 |
| −1,178 | −1,229 |
| −1,176 | −1,516 | −1,601 |
| −1,799 | −1,598 | 1,945 | |
| c | −1,231 | −1,176 |
|
| −1,228 | −1,522 | −1,173 | −1,602 | −1,799 |
| −1,944 | −1,599 | |
| d | −985 | −984 | −985 |
| −984 | −984 | −982 |
|
|
| −1,531 | −1,533 | |
| e | −9,85 | −984 | −985 | −984 |
| −984 | −982 |
|
|
| −1,531 | −1,533 | |
| f |
| −1,325 | −1,531 | −1,784 | −1,528 |
| −1,320 | −1,876 | −1,669 | −1,871 |
| −1,874 | |
| g | −1,532 | −1,528 | −1,328 | −1,529 | −1,782 | −1,325 |
| −1,878 | −1,872 | −1,669 | −1,873 |
| |
| m | −1,521 | −1,663 | −1,869 | −2,123 | −1,864 | −1,317 | −1,658 | −2,012 | −2,353 | −2,555 |
| −2,009 | |
| n | −1,523 | −1,866 | −1,665 | −1,867 | −2,119 | −1,663 | −1,315 | −2,013 | −2,556 | −2,353 | −2,009 |
| |
Cells represent the average (over 1,000 replicates) of the scores describing how well each model fits the data. Columns represent data simulated under the 12 different scenarios and rows describe which model is being tested. Each of the four methods uses a different score to assess model fit. For SEM, low numeric scores indicate better fit. For the other three methods, higher numeric scores indicate better fit. Average score(s) that indicate the preferred model out of those tested are underlined. Cells with bold indicate the correct model choice.
Figure 3Results showing the effect of changing the effect size of the common environmental effect E (ζ) on inference. The x‐axis shows the value of ζ used and the y‐axis shows the proportion of time (the proportion of replicates where) the correct causal scenario was identified for data simulated under model (h) (left panel) and (i) (right panel)
Figure 4Results showing the effect of changing the effect size (α) on inference. The x‐axis shows the value of α used in the simulation model, the y‐axis shows the proportion of time (the proportion of replicates where) the correct causal scenario was identified for data simulated under models (a) (left panel), (b) (right panel)