| Literature DB >> 27146517 |
Harold Bae1, Stefano Monti2, Monty Montano3, Martin H Steinberg2, Thomas T Perls2, Paola Sebastiani4.
Abstract
Bayesian networks are probabilistic models that represent complex distributions in a modular way and have become very popular in many fields. There are many methods to build Bayesian networks from a random sample of independent and identically distributed observations. However, many observational studies are designed using some form of clustered sampling that introduces correlations between observations within the same cluster and ignoring this correlation typically inflates the rate of false positive associations. We describe a novel parameterization of Bayesian networks that uses random effects to model the correlation within sample units and can be used for structure and parameter learning from correlated data without inflating the Type I error rate. We compare different learning metrics using simulations and illustrate the method in two real examples: an analysis of genetic and non-genetic factors associated with human longevity from a family-based study, and an example of risk factors for complications of sickle cell anemia from a longitudinal study with repeated measures.Entities:
Mesh:
Year: 2016 PMID: 27146517 PMCID: PMC4857179 DOI: 10.1038/srep25156
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Example of Ignoring Within-Cluster Correlations When Learning BN.
2,000 simulated data sets were generated using the network structure shown on the left and assuming normal distributions for the 5 variables. In 1,000 sets, the observations were IID, and in the remaining 1,000 sets data were generated from 581 independent clusters, with observations correlated within clusters. The table summarizes the number of times the true network was selected in 1,000 simulations with IID observations and 1,000 simulations with correlated data, the false positive rates, and family-wise error rates using three common model selection metrics and a forward search. False positive rates were defined as the number of additional or missing edges over the total number of tests, and family-wise error rates were defined as the probability of one or more errors in the overall search. BIC: Bayesian Information Criterion; AIC: Akaike Information Criterion; LRT: Likelihood Ratio Test at α = 0.05.
Figure 2Example of BN with 3 observable variables (Y1, Y2, Y3) and parameter vectors θ = (θ1, θ2, θ3).
If there are no missing data, the observations are independent, and the prior distribution of the parameters follow Hyper-Markov law, then the marginal likelihood p(D|M) factorizes into a product of 3 local marginal likelihood functions.
Figure 3An Example Pedigree and Corresponding Additive Genetic Relationship Matrix.
The kinship matrices contain pairwise kinship coefficients between pairs of family members and these coefficients represent the probability that two individuals share the same gene allele by identity by descent. The covariance between two family members with kinship coefficient k is 2kγ2 where γ2 represents the genetic variance.
Figure 4Left panel: common parameterization of a simple directed graphical model with 3 observable, Gaussian variables (Y1, Y2, Y3), conditional of the parameter vector θ. Nodes in orange are the parameters that define the conditional parent-children distribution of the observable variables (fixed effects), while the nodes in yellow are nuisance parameters. Right panel: our proposed parameterization when both the dependency structure and conditional probability distributions need to be estimated from correlated data. The random effects α (blue nodes) have probability distributions that depend on parameters γ (lavender nodes). Both parameters γ and random effects α are used to model the correlation between observations as in Equation (4).
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Normally Distributed Data When h 2 = 0.50.
| Score | Number of False Positive Covariates At Each Level | Tot Test | Error Rates | ||||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | ≥5 | FPR | FWER | |||
| 4656 | 46 | 0 | 0 | 0 | 0 | 10405 | 0.0044 | 0.045 | |
| 2796 | 59 | 1 | 0 | 0 | 0 | 10530 | 0.0057 | 0.058 | |
| 1768 | 74 | 2 | 0 | 0 | 0 | 10647 | 0.0071 | 0.071 | |
| 582 | 128 | 7 | 1 | 0 | 0 | 11135 | 0.0122 | 0.120 | |
| 4656 | 1616 | 779 | 270 | 71 | 17 | 23310 | 0.1181 | 0.836 | |
| 4656 | 513 | 100 | 13 | 2 | 0 | 14535 | 0.0432 | 0.415 | |
| 4656 | 128 | 7 | 0 | 0 | 0 | 11136 | 0.0121 | 0.120 | |
| 4656 | 965 | 311 | 79 | 8 | 0 | 18218 | 0.0748 | 0.642 | |
| 4656 | 2316 | 1381 | 685 | 277 | 110 | 28479 | 0.1674 | 0.931 | |
Levels indicate the hierarchy in the forward search procedure such that Level 1 indicates the search is performed on all 10 covariates, Level 2 indicates that the search is performed on 9 covariates given that at least one false positive covariate was selected in the previous level, and so forth. BIC: BIC based on integrated likelihood and full sample size; BIC, BIC, BIC: BIC with Jones’, Young and conservative effective sample size; AIC: AIC based on integrated likelihood and full sample size; LRT: likelihood ratio test based on integrated likelihood to account for correlated data; BIC, LRT and AIC: traditional BIC, likelihood ratio test, and AIC. FPR is the false positive rate defined as number of errors over total number of tests ignoring correlated data; FWER is family wise error rate, i.e., probability of one or more errors.
Power Comparisons of Four Variants of BIC vs. Corresponding LRT (Normally Distributed Data).
| Power | ||||
|---|---|---|---|---|
| Strong Effect | Moderate Effect | Weak Effect | ||
| 0.0044 | 0.572 | 0.295 | 0.139 | |
| 0.0044 | 0.593 | 0.314 | 0.151 | |
| 0.0057 | 0.608 | 0.322 | 0.162 | |
| 0.0057 | 0.623 | 0.340 | 0.172 | |
| 0.0071 | 0.635 | 0.346 | 0.178 | |
| 0.0071 | 0.648 | 0.362 | 0.186 | |
| 0.0122 | 0.708 | 0.426 | 0.245 | |
| 0.0122 | 0.710 | 0.429 | 0.247 | |
Results are based on 1,000 simulated datasets with 3 situations of strong, moderate, and weak covariate effects. BIC: BIC based on integrated likelihood and full sample size; BIC, BIC, BIC: BIC with Jones’, Young and conservative effective sample size; , , , and : likelihood ratio test based on integrated likelihood using the significance threshold obtained from empirical false positive rates of BIC, BIC, BIC and BIC. For example, since BIC has an observed false positive rate of 0.0044, we compared the power of the BIC to the power of the LRT with significance threshold of 0.0044.
False Positive Rates and Family-wise Error Rates of Different Model Selection Metrics For Time-to-event Data When h2 = 0.50.
| Score | Number of False Positive Covariates At Each Level | Tot Test | Error Rates | |||||
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | ≥5 | FPR | FWER | ||
| 71 | 1 | 0 | 0 | 0 | 10638 | 0.0068 | 0.070 | |
| 1654 | 831 | 327 | 81 | 21 | 23553 | 0.1237 | 0.822 | |
| 561 | 120 | 14 | 3 | 0 | 14850 | 0.0470 | 0.436 | |
| 121 | 11 | 0 | 0 | 0 | 11069 | 0.0119 | 0.109 | |
| 2057 | 1180 | 530 | 188 | 58 | 26572 | 0.1510 | 0.884 | |
| 767 | 226 | 46 | 5 | 0 | 16604 | 0.0629 | 0.543 | |
BIC: BIC based on integrated likelihood and number of events as the sample size; AIC: AIC based on integrated likelihood and full sample size; LRT: likelihood ratio test based on integrated likelihood to account for correlated data; BIC, LRT and AIC: traditional BIC, likelihood ratio test, and AIC. FPR is the false positive rate defined as number of errors over total number of tests ignoring correlated data; FWER is family wise error rate, i.e., probability of one or more errors.
Power Comparisons of BIC vs. Corresponding LRT For Time-to-event Data.
| Power | |||||
|---|---|---|---|---|---|
| Strong Effect | Moderate Effect | Weak Effect | |||
| 0.0073 | 0.961 | 0.726 | 0.490 | ||
| 0.0073 | 0.964 | 0.741 | 0.502 | ||
| 0.0068 | 0.830 | 0.516 | 0.315 | ||
| 0.0068 | 0.841 | 0.522 | 0.323 | ||
| 0.0078 | 0.513 | 0.255 | 0.144 | ||
| 0.0078 | 0.540 | 0.285 | 0.161 | ||
Results are based on 1,000 simulated datasets with 3 situations of strong, moderate, and weak covariate effects. BIC: BIC based on integrated likelihood and number of events as the sample size; : likelihood ratio test based on integrated likelihood using the significance threshold obtained from empirical false positive rates.
Summary of 23 Genes in the IIS Pathway.
| Gene | Chromosome | Number of Tested SNPs |
|---|---|---|
| 14 | 78 | |
| 19 | 142 | |
| 1 | 793 | |
| 13 | 275 | |
| 6 | 110 | |
| 1 | 674 | |
| 5 | 506 | |
| 12 | 994 | |
| 15 | 854 | |
| 8 | 76 | |
| 11 | 75 | |
| 19 | 859 | |
| 2 | 2105 | |
| 13 | 1637 | |
| 16 | 9 | |
| 3 | 318 | |
| 3 | 391 | |
| 1 | 172 | |
| 7 | 883 | |
| 5 | 4179 | |
| 19 | 32 | |
| 1 | 207 | |
| 17 | 295 |
Figure 5Top 3 BNs built using the proposed parameterization that dissect the associations of SNPs in genes of the IIS pathway through effects on blood biomarkers.
The different edges among the three networks are colored in red.
Markov Blanket of Each Node in the Top 3 BNs.
| Node | MB in | MB in | MB in |
|---|---|---|---|
| FUS | TR, Age.E, Hgb, Sex, rs1009375 | TR, Age.E, Hgb, Sex, rs1009375 | TR, Age.E, Hgb, Sex, rs1009375 |
| Age.E | BYC, Sex, rs6974881, FUS, TR, Hgb, rs1009375 | BYC, Sex, rs6974881, FUS, TR, Hgb, rs1009375 | BYC, Sex, rs6974881, FUS, TR, Hgb, rs1009375 |
| DHEA | Hgb, IGF1, BYC, TR | Hgb, IGF1, BYC, TR | Hgb, IGF1, BYC, TR |
| TR | Hgb, BYC, DHEA, IGF1, FUS, Age.E, Sex, rs1009375 | Hgb, BYC, DHEA, IGF1, FUS, Age.E, Sex, rs1009375 | Hgb, BYC, DHEA, IGF1, FUS, Age.E, Sex, rs1009375 |
| IGF-1 | Hgb, Tr, BYC, DHEA | Hgb, Tr, BYC, DHEA | Hgb, Tr, BYC, DHEA |
| Hgb | BYC, IGF1, FUS, Tr, DHEA, Age.E, Sex, rs1009375 | BYC, IGF1, FUS, Tr, DHEA, Age.E, Sex, rs1009375 | BYC, IGF1, FUS, Tr, DHEA, Age.E, Sex, rs1009375 |
FUS: Follow-up Survival; Age.E: Age at enrollment; DHEA: Dehydroepiandrosterone; TR: Transferrin Receptors; IGF-1: Insulin-like growth factor 1; INS: Insulin; Hgb: Hemoglobin.
Figure 6Top 3 BNs built ignoring the familiar correlations in the data used in Fig. 5.
The different edges among the three networks are colored in red. Compared to the BNs in Fig. 5, two additional SNPs rs17224116 and rs10048024 are added to the models.
Figure 7Left Panel: Top BN using the proposed approach and associated Markov Blanket of each node. Right Panel: Top BN built ignoring correlations due to the repeated measurements on the same subjects and associated Markov Blanket of each node. Additional variables in the Markov Blanket as a result of ignoring correlations are colored red. Hg: hemoglobin; SGOT: serum glutamic oxaloacetic transaminase; DBP: diastolic blood pressure; Retic: reticulocyte count; Platelet: platelet count; RBC: red blood cells; WBC: white blood cells; HbF: fetal hemoglobin; MCV: mean corpuscular volume.