| Literature DB >> 23514120 |
Chengwei Su1, Angeline Andrew, Margaret R Karagas, Mark E Borsuk.
Abstract
We review the applicability of Bayesian networks (BNs) for discovering relations between genes, environment, and disease. By translating probabilistic dependencies among variables into graphical models and vice versa, BNs provide a comprehensible and modular framework for representing complex systems. We first describe the Bayesian network approach and its applicability to understanding the genetic and environmental basis of disease. We then describe a variety of algorithms for learning the structure of a network from observational data. Because of their relevance to real-world applications, the topics of missing data and causal interpretation are emphasized. The BN approach is then exemplified through application to data from a population-based study of bladder cancer in New Hampshire, USA. For didactical purposes, we intentionally keep this example simple. When applied to complete data records, we find only minor differences in the performance and results of different algorithms. Subsequent incorporation of partial records through application of the EM algorithm gives us greater power to detect relations. Allowing for network structures that depart from a strict causal interpretation also enhances our ability to discover complex associations including gene-gene (epistasis) and gene-environment interactions. While BNs are already powerful tools for the genetic dissection of disease and generation of prognostic models, there remain some conceptual and computational challenges. These include the proper handling of continuous variables and unmeasured factors, the explicit incorporation of prior knowledge, and the evaluation and communication of the robustness of substantive conclusions to alternative assumptions and data manifestations.Entities:
Year: 2013 PMID: 23514120 PMCID: PMC3614442 DOI: 10.1186/1756-0381-6-6
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1A simple BN representing the relationship between cancer incidence (C), environmental exposure (E), a biomarker (B) and three single nucleotide polymorphisms (S1, S2, S3). See text for further description.
First five observations of dataset with some missing values represented by
| 1 | 1 | 1 | 0 |
| 2 | 1 | 1 | |
| 3 | 1 | 0 | 0 |
| 4 | 1 | 0 | |
| 5 | 0 | 0 | 1 |
| ⋮ | ⋮ | ⋮ | ⋮ |
First five observations of probabilistically completed dataset
| 1 | 1 | 1 | 1 | 0 |
| 2 | 0 | 0.359 | 1 | 1 |
| 1 | 0.641 | |||
| 3 | 1 | 1 | 0 | 0 |
| 4 | 0 | 0.417 | 1 | 0 |
| 1 | 0.583 | |||
| 5 | 0 | 1 | 0 | 1 |
| ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
Figure 2Comparison of the BNs learned by four different algorithms. Edges that differ between the two networks are indicated in grey in the bottom graph.
Comparison of different algorithms
| Tests | 259 | 366 | 750 | 921 |
| Directed arcs | 10 | 8 | ||
| Log-likelihood | -4045.78 | |||
| AIC | -4073.78 | |||
| log(K2) | -4125.07 | |||
| BIC | -4137.16 | |||
The highest value of each score is in bold.
Figure 3A non-causal network. Edges that differ relative to the causal networks are indicated in grey. Shaded nodes indicate the Markov blanket of CANCER.
Figure 4Candidate structures (represented by dotted edges) in which TOENAIL_AS would be included in the Markov blanket of CANCER.
Expected scores for the five candidate structures shown in Figure4after applying the EM algorithm
| none | -3824.1 | -3837.1 | -3861.7 | |
| TOENAIL_AS → CANCER | -3824.6 | -3839.6 | -3865.7 | -3877.2 |
| CANCER → TOENAIL_AS | -3824.6 | -3838.6 | -3865.3 | -3873.7 |
| TOENAIL_AS → SMOKER | -3820.9 | -3837.9 | -3864.6 | -3880.5 |
| TOENAIL_AS → XRCC3_241 | -3875.4 |
The highest scoring network according to each score is in bold.
Figure 5Final structure with directed edges in which no causal interpretation is implied (left) and equivalent undirected Markov network (right). Grey edges are new relative to earlier structures.
Prognostic bladder cancer risk for some of the 32 possible combinations of risk factors
| 1 | female | no | low | variant | variant | 0.68 | 13 |
| 2 | female | no | low | wildtype | wildtype | 1.0 (ref) | 22 |
| 3 | female | no | low | variant | wildtype | 1.10 | 16 |
| 4 | female | no | low | wildtype | variant | 1.45 | 39 |
| 5 | female | no | high | variant | wildtype | 2.02 | 2 |
| 6 | female | yes | low | wildtype | wildtype | 2.22 | 27 |
| 7 | male | no | low | wildtype | wildtype | 2.23 | 20 |
| 8 | male | yes | high | variant | variant | 2.25 | 5 |
| 9 | female | yes | low | variant | wildtype | 2.45 | 24 |
| 10 | male | yes | low | variant | wildtype | 4.36 | 181 |
| 11 | female | yes | high | variant | wildtype | 4.48 | 9 |
| 12 | male | yes | high | wildtype | variant | 5.17 | 14 |
| 13 | male | yes | low | wildtype | variant | 5.75 | 95 |
| 14 | male | yes | high | variant | wildtype | 7.99 | 14 |
Logistic regression results for associations discovered in final BN model
| Intercept | -1.382 | 0.170 | <0.00001 | 1 | - |
| S | 0.640 | 0.154 | 0.00003 | 1.90 | (1.40, 2.56) |
| G | 0.633 | 0.143 | <0.00001 | 1.88 | (1.42, 2.49) |
| X4 | 0.403 | 0.179 | 0.024 | 1.50 | (1.05, 2.12) |
| A:X241 | 0.435 | 0.295 | 0.140 | 1.55 | (0.87, 2.76) |
| X241:X4 | -0.731 | 0.240 | 0.002 | 0.48 | (0.30, 0.77) |