Literature DB >> 25631319

Bayesian network reconstruction using systems genetics data: comparison of MCMC methods.

Shinya Tasaki¹, Ben Sauerwine², Bruce Hoff³, Hiroyoshi Toyoshiba⁴, Chris Gaiteri⁵, Elias Chaibub Neto⁶.

Abstract

Reconstructing biological networks using high-throughput technologies has the potential to produce condition-specific interactomes. But are these reconstructed networks a reliable source of biological interactions? Do some network inference methods offer dramatically improved performance on certain types of networks? To facilitate the use of network inference methods in systems biology, we report a large-scale simulation study comparing the ability of Markov chain Monte Carlo (MCMC) samplers to reverse engineer Bayesian networks. The MCMC samplers we investigated included foundational and state-of-the-art Metropolis-Hastings and Gibbs sampling approaches, as well as novel samplers we have designed. To enable a comprehensive comparison, we simulated gene expression and genetics data from known network structures under a range of biologically plausible scenarios. We examine the overall quality of network inference via different methods, as well as how their performance is affected by network characteristics. Our simulations reveal that network size, edge density, and strength of gene-to-gene signaling are major parameters that differentiate the performance of various samplers. Specifically, more recent samplers including our novel methods outperform traditional samplers for highly interconnected large networks with strong gene-to-gene signaling. Our newly developed samplers show comparable or superior performance to the top existing methods. Moreover, this performance gain is strongest in networks with biologically oriented topology, which indicates that our novel samplers are suitable for inferring biological networks. The performance of MCMC samplers in this simulation framework can guide the choice of methods for network reconstruction using systems genetics data.

Entities: Chemical Disease Gene Species

Keywords: Bayesian networks; MCMC methods; causal inference; eSNPs; network reconstruction

Mesh：

Year: 2015 PMID： 25631319 PMCID： PMC4391572 DOI： 10.1534/genetics.114.172619

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

COMPLEX diseases such as Alzheimer’s disease and type 2 diabetes are influenced by intricate gene-to-gene and gene-by-environment interactions (Peila ; Huang ; Liu ; Rhinn ). The goal of gene network reverse engineering is to learn the gene-to-gene interaction architecture underlying such diseases. Some algorithms output networks that correspond to putative causal relationships among genes, by combining SNP and expression data. These inferred networks can be used to generate hypotheses that can be validated experimentally (Schadt ; Chen ; Aten ; Ferrara ; Chaibub Neto , 2013; Duarte and Zeng 2011). In particular, Bayesian approaches to inferring causal networks are becoming a common practice in the field of systems biology (Zhu , 2008; Chaibub Neto ; Hageman ; Moon ) and have successfully generated novel insights into biological processes (Zhang ). Nonetheless, inferring the structure of Bayesian networks remains a challenging statistical and computational problem. Specifically, the relative performance of different causal network reconstruction algorithms is unclear when they are applied to real biological data sets. Therefore, to enable more accurate network reconstructions, we conduct a systematic comparison of the performance of several Markov chain Monte Carlo (MCMC) samplers, as well as novel sampling methods. Our conclusions facilitate the recovery of correct networks under a range of biologically realistic scenarios. We report the results of a large-scale simulation study comparing the relative performance of state-of-the-art MCMC samplers (Grzegorczyk and Husmeier 2008; Goudie and Mukherjee 2011) with novel sampler variations developed by us and with the standard Metropolis–Hastings structure samplers (Madigan ; Giudici and Castelo 2003). Efforts to compare MCMC samplers have focused on data generated from a handful of benchmark networks such as the ALARM network (Beinlinch ). Such limited test sets can lead to conclusions about the performance of various methods that are not necessarily robust. Therefore, to reach high-confidence conclusions, we evaluated the merits of various network reconstruction algorithms over a wide range of biologically relevant networks. For instance, envision that researchers develop sampler A that, in theory, is expected to outperform an alternative sampler, B, on sparse networks. To empirically check this hypothesis, the researchers may restrict their attention to a single benchmark network. Even if multiple networks are generated to test this hypothesis, they likely come from a specific parameter set, such as 200 data samples generated from a sparse network consisting of 40 continuous variables, via a set of linear structural equations with moderate values for the regression coefficients and residual variances. Suppose further that, as expected, sampler A does outperform sampler B across most of the simulations. Although the researchers might be inclined to claim that sampler A is better than B in sparse networks, we argue that the researchers cannot really support this statement because the effect of sparsity on the sampler’s performance is confounded with the effects of the other simulation parameters, namely, the sample size, the number of nodes, and the amounts of signal and noise. That is, it is not possible to determine whether sampler A outperforms sampler B, because the benchmark network is sparse or because, in reality, sampler A is better than B for the particular choice of simulation parameters adopted in the simulation. All the researchers can claim is that sampler A performs better than B for the particular values of the simulation parameters adopted in the study, but there are no guarantees that sampler A would outperform sampler B, had the researchers chosen a different set of simulation parameter values. Therefore, to perform a rigorous comparison of the performance of different MCMC samplers and to investigate the conditions under which one sampler performs better than another, we designed a multifactorial simulation study with crossed factors. In our simulation study, network parameters played the role of factors, and the difference in area under the response curve of different samplers played the role of the response variable (Figure 1). By comparing the performance of methods across many different situations, the differences in performance we observe are less likely to be due to a specific parameter setting. Moreover, we can track how different types of network topology influence the performance of all methods or a specific method. This enables biologists to understand how data sets with different origins and features are likely to affect the accuracy of estimated networks and which methods are optimized for the characteristics of a particular biological data set.

Figure 1

Workflow of simulation study. Systems genetics data composed of gene expression and genotyping data were generated based on a regulatory network consisting of genes and SNPs. Bayesian networks were estimated by applying various structure MCMC samplers to these systems genetics data, generated by a known network structure. The composite networks obtained by Bayesian model averaging were then compared with a true network structure. AUCPR and AUCROC are employed as a performance measure. Overall performance of different MCMC samplers and identification of the network features that affect the accuracy of network reconstruction were investigated through ANOVA of the AUCPR or AUCROC results. In our simulations, we investigated the effects of eight distinct simulation parameters, namely sample size, network topology, the number of network nodes, incorporation of genetic information, average edge density, average gene-to-gene signal, average SNP-to-gene signal, and intrinsic expression noise. (See Methods for further details.) These simulation variables were chosen to represent aspects of biological data sets that are reasonably expected to vary in a wide range of future applications. For instance, the strength of gene–gene correlations is controlled by a variety of biophysical processes such as transcription factor binding, chromosome configuration, and epigenetics (Gaiteri ). We compared the performance of four published MCMC samplers and five novel sampler variations that we developed. The published samplers included (i) the foundational Metropolis–Hastings structure sampler (“STR sampler”) (Madigan ; Giudici and Castelo 2003), (ii) the “REV” Metropolis–Hastings sampler (Grzegorczyk and Husmeier 2008), (iii) the single-parent set block Gibbs sampler (“1PB”), and (iv) the two-parent sets block Gibbs sampler (“2PB”) (Goudie and Mukherjee 2011). The new samplers included (v) the three-parent sets block Gibbs sampler (“3PB”), (vi) the four-parent sets block Gibbs sampler (“4PB”), (vii) the connected two-parent sets block sampler (“c2PB”), (viii) the connected three-parent sets block sampler (“c3PB”), and (ix) the connected four-parent sets block sampler (“c4PB”). Descriptions and comments on each of these samplers are provided in the Appendix. Since our multifactorial design involves a large number of distinct factor-level combinations (1458 in total), and for each combination we run nine distinct MCMC samplers, we considered only a single replication per simulation parameter combination. Each sampler was run for a fixed time with either a single longer chain or multiple shorter chains. Collectively, the experiment included 52,488 instances of reverse-engineering gene regulatory networks. To the best of our knowledge, this is the largest simulation study comparing MCMC samplers for structure learning in Bayesian networks.

Methods

Bayesian networks

Background and technical material on Bayesian networks and the MCMC samplers investigated in this study are presented in the Appendix.

Setting of the MCMC runs

Each sampler was run for a fixed time with either a single longer chain or multiple shorter chains, starting from different random initial structures. For 30, 65, and 100 node networks, we executed MCMC samplers in 30, 300, and 2100 sec, respectively, on an Opteron 2.2-GHz core. For single-chain running, we collected at most 5000 networks uniformly from a chain and 20% of initial networks were discarded as burn-in. When running multiple chains, the chain length was controlled so that on average 50 chains completed in the time allotted. The last network of a chain was collected from each chain.

Program implementation

All the MCMC samplers were implemented in MATLAB 2012a (MathWorks). To conduct fair comparisons, the routines shared across the MCMC samplers were implemented in the same way by using the same subfunctions. Moreover, the routines specific for each MCMC sampler were highly optimized. All the programs were compiled by MATLAB Compiler to enable batch execution. The code to run all methods is available for download (DOI: 10.7303/syn2910187).

Simulation parameters

The selection and range of simulation parameters were designed to reflect characteristics of typical systems genetics experiments. To investigate the relative performance of the different structure samplers, we designed a multifactorial simulation experiment with crossed factors, focusing on the effects of seven distinct simulation parameters, namely the following:In total, our experiment was conducted based on 2 × 3 × 3 × 3 × 3 × 3 × 3 = 1458 distinct networks. Network topology t, with levels “random” and “EIPO” (exponential in in-degree and power law on out-degree, as proposed by Guelzim ). Number of network nodes, p, with levels “30,” “65,” and “100.” Edge density, d, with levels “low,” “medium,” and “high.” The edge density of a network is defined as the number of edges divided by the number of possible edges, namely, . For each simulation, d was set to 0.02, 0.04, and 0.06 as low, medium, and high edge density, respectively. We did not limit the maximum number of parent nodes in simulated networks, even though sampling methods often limit the number of possible parents, for reasons of computational efficiency. Average gene-to-gene signal, η, defined as the average absolute value of the nonzero coefficients (of the regression of a gene on another). For each simulation, η was sampled uniformly from the ranges [0, 0.33], [0.33, 0.66], [0.66, 1] corresponding to low, medium, and high, respectively. Intrinsic expression noise, , with levels low, medium, and high, and defined as the variance of the error term, , used in the simulation of the expression values (via linear structural equations). For each simulation, was sampled uniformly from the ranges [0, 1], [1, 2], [2, 3] corresponding to low, medium, and high, respectively. Average SNP-to-gene signal, γ, defined as the average absolute value of the nonzero coefficients (of the regression of an expression phenotype on a SNP). For each simulation, γ was sampled uniformly from the ranges [0, 1], [1, 2], [2, 3] corresponding to low, medium, and high, respectively. Sample size, n, with levels 100, 200, 300. This choice reflects typical sample sizes observed in the literature for causal gene networks (Zhang ).

Simulation of network structures

We generated network structure based on EIPO topology observed in transcriptional regulatory networks (Guelzim ). Random topology networks were also generated as reference. We did not limit the maximum number of parent nodes for the simulated networks. Network structures were generated using the SysGenSIM software (Pinna ). This software allows the user to control the number of nodes and the network average degree (=), as well as choose among several topology types, including the random network topology and the EIPO topology. Although the software can also simulate genotype and expression data from experimental crosses (where the latter are generated according to nonlinear ordinary differential equations), we do not employ those features for data generation. Instead, we generate genotype data from outbred populations and simulate expression data from a multivariate normal distribution consistent with the structural equation model representing the network structure.

Simulation of SNP data

To incorporate realistic genetics data in our simulations, we generate SNP data matrices by randomly selecting chunks of real SNP data from the HapMap3 database. To do this, we first randomly choose p genes from the refGene in the UCSC Genome Browser and then select all cis-SNPs associated with the p genes from genotype data on a Caucasian population. We define a cis-SNP as any SNP physically located between (−)110 kb upstream of the transcription start site and (+)40 kb downstream of the transcription end site, because this region is thought to contain 99% of cis-eSNPs (e: expression single nucleotide polymorphism) (Veyrieras ).

Simulation of network data

Given the SNP data and a (possibly cyclic) directed network structure, G, with continuous (gene expression) and discrete (SNP) nodes, we generated multivariate normal gene expression data with a covariance structure implied by the structural equation model (SEM) describing G. Because the SNP nodes are necessarily exogenous variables in G, the associated SEM can be represented, in matrix notation, as (Liu ), where (i) ϵ is a matrix of independent and identically distributed residual error terms, (ii) X is a matrix of expression levels, and (iii) Q is a matrix of SNP genotype codes. B is a matrix of gene-to-gene causal effects, where the element represents the partial regression coefficient of the regression of gene i on gene j. The structure of matrix B corresponds to the directed graph representing the interactions between the genes with edges corresponding of nonzero elements of B. Because we adopt independent residual error terms, matrix B can be rearranged into a lower triangular matrix when G is acyclic, but will be necessarily nontriangular for cyclic graphs. C is a matrix of SNP-to-gene causal effects. Each entry () of C represents the partial regression coefficient of the regression of gene i on SNP j. The SNP-to-gene edges in G are represented by the nonzero elements of C. Given the matrices B, C, Q, and ϵ, we generate the expression data matrix as . Note that because each column of ϵ follows a distribution, we have that the conditional distribution of each column of X given B, C, and Q is . Furthermore, no diagonal dominance enforcement of was necessary to ensure matrix invertibility since the simulated networks were small; the diagonal and most of the off-diagonal entries of B were zero (due to the absence of self loops and relatively low connectivity of the simulated networks), and the nonzero entries of B consisted of low values. The values of the nonzero partial regression coefficients in B and C are sampled randomly from a uniform distribution and then rescaled so that the average of their absolute values is η and γ, respectively. The range of η, γ, and ρ is determined so that the variance of the expression phenotypes explained by eSNPs in simulated data covers that observed in real data, which ranges from 0.1 to 0.3 (Grundberg ; McKenzie ; Yang ). In particular, the median variance explained by SNPs of our simulated data was 0.11 with a median absolute deviation of 0.15. An additional constraint to C was that 20% of the expression phenotypes have SNPs, which was based on the empirical proportion of genes with eSNPs, ranging from 5% to 35% (Brown ). In addition, each such expression phenotype can have at most two SNPs, since three independent eSNPs for a single gene were very rare in studies with moderate sample size (Stranger ). Consequently, 80% of the rows of C are completely 0, and the remaining 20% of the rows can have at most two nonzero entries.

eSNP mapping

We perform eSNP mapping tailored to the network structure by evaluating conditional independence relations between expression traits and SNPs, given their expression trait parents (Chaibub Neto ). The likelihood ratio between the full model and the null model that is generated by removing the SNPs term from the equation can be used as a formal test of conditional independence. The empirical null distribution of the likelihood ratio is estimated by 1000 permutations of individual labels as follows. At each time, the individual labels of either expression trait or SNPs were randomly permuted so that relations between expression trait and SNPs were broken while keeping the correlation structure among expression traits intact (and while preserving the correlation structure of the SNPs as well). Then the likelihood ratios of all the pairwise relationships between expression traits and their cis-SNPs were calculated, followed by the collection of the maximum value across all computed likelihood ratios. The permutation null distribution generated by this procedure is then used to test the null hypothesis of detecting an association given that none of the SNPs are associated with the expression trait. The genome-wide error rate of detecting cis-eSNP associations was controlled at 0.05.

Performance measures

We evaluated network reconstruction performance, using the area under the curve of the precision-recall plot (AUCPR), whereand a true positive was defined as an inferred directed edge that existed in the true network used to generate the expression data. In some cases, the area under the curve of the receiver operating characteristic (AUCROC) was used for performance evaluation. AUCROC is a composite measure based on true positive rate (recall) and false positive rate, defined aswhere a false positive was defined as an inferred directed edge that did not exist in the true network used to generate the expression data.

Results

Network characteristics affect the performance of Bayesian network reconstruction

To elucidate effects of various biological characteristics on network estimation accuracy, we performed a systematic simulation experiment with an ensemble of systems genetics data where each parameter (network characteristic) was sampled over a biologically plausible range. To evaluate the accuracy of an estimated network, we applied Bayesian model averaging to the probabilistic network estimates produced by any given method, to obtain a composite adjacency matrix. Then, AUCPR as adopted as a measure of correctness of an output adjacency matrix, given a true network. Using this response variable, we used ANOVA to identify network features that have significant impact on the performance of network reconstruction. All the network characteristics tested, as well as various 2 × 2 interactions among these, were found to be significantly associated with AUCPR (Figure 2), but the strength of the effects and form of the interactions varied widely. Our experiment indicates that network topology affects the accuracy of network reconstruction. Networks with EIPO topology were reconstructed more accurately than those with random topology (Figure 2A). The experiment also identified harmful and beneficial effects of network characteristics on estimation accuracy. The increase in the number of genes, edge density, and noise level decreased AUCPR monotonically. Conversely, the increase of sample size and strength of SNP-to-gene signal positively contributed to the accuracy of reverse engineering (Figure 2A). Unlike the other network characteristics, gene-to-gene signal strength showed a nonmonotonic effect on AUCPR (Figure 2A). Initially, the increase in gene-to-gene signal strength had a beneficial effect on performance because stronger signals aided in distinguishing true regulations from noise. However, once the gene-to-gene signal strength exceeded a certain point, it showed the opposite effect (Figure 2A). Strength of gene-to-gene signal also showed significant interactions with other characteristics, including sample size, the number of genes, edge density, and network topology (Figure 2B). However, these interactions did not alter the nonmonotonic trend of the effect of gene-to-gene signal on AUCPR (Supporting Information, Figure S1).

Figure 2

Effect of network features on the performance of Bayesian network reconstruction. (A) Marginal performance plots for each one of the seven simulation parameters. The dashed horizontal line represents the mean of the AUCPR distribution. (B) The heatmap represents significance for the interactions between any two pairs of the seven simulation parameters. We found that secondary effects of gene-to-gene signaling accounted for the nonmonotonic effect of gene-to-gene signal on AUCPR. We observed that high average gene-to-gene signal strength monotonically increased the correlations between directly connected genes and also those between indirectly connected genes (Figure 3A). These indirect effects of gene-to-gene signal strength were quantified by M = log10(directCor/indirectCor) based on the average correlations of directly connected genes and those of indirectly connected genes. This M measure showed a nonmonotonic trend along with the amount of gene-to-gene signal strength (Figure 3A) and was well correlated with AUCPR (Figure 3B). This suggests that the poor network reconstruction performance against networks with high gene-to-gene signal is due to the low M of the data set through the increase of indirect signals.

Figure 3

Tracking both direct and indirect gene-to-gene correlations explains how moderate correlations lead to the most accurate network inference. (A) Increasing the gene-to-gene signal strength increases the average correlation between genes, as expected. The strength of indirect correlations between genes also increases as gene-to-gene signal strength increases. Direct and indirect correlations are the average correlation of directly connected genes and indirectly connected genes, respectively. The M measure represents the log10 ratio between direct correlation and indirect correlation in each data set. (B) The bell-shaped response of AUCPR to increasing direct gene-to-gene correlations might be unexpected, but once the concurrent increase of indirect correlations is taken into account (through the M measure), the expected relationship between increasing signal and increased network accuracy is found (bottom right).

Comparison of structural samplers

Next we investigated general performances of structural samplers. For rigorous quantification, both AUCPR and AUCROC were employed as performance measures. Based on these measures, c2PB, c3PB, and REV outperformed other samplers, whereas STR, a traditional sampler, showed the worst performance (Figure 4A). The main methodological difference between top-performing samplers c2PB, c3PB, and REV and STR is the magnitude of modifications to a network offered at each step. Specifically, REV and c2PB update parents of two nodes and c3PB updates parents of three nodes simultaneously at each step. By contrast, STR modifies only a single edge of a network at each step. Superior performance of c2PB, c3PB, and REV was quite robust over many types of networks tested (Figure S2), suggesting samplers that execute drastic network modifications at a single step can potentially generate more accurate network structures within a set time compared to methods that employ smaller updates.

Figure 4

Performance comparison of network structure samplers. (A) Marginal performance plots for the nine MCMC samplers. (B) Significance of interaction between MCMC samplers and the nine simulation parameters. Asterisks indicate the variables that are significant at P-value <10E-16. (C) The interaction plots for the nine MCMC samplers and two simulation parameters: sample size and MCMC chain type. The performance of the samplers interacted with various network and simulation parameters (Figure 4B). We found that sample size and MCMC chain type had practically important contributions to the performance of the samplers. The performance gain offered by larger sample size was not observed uniformly: a limited set of samplers, namely c2PB, c3PB, and REV, benefits most from larger sample size (Figure 4C). Note that these samplers are all able to execute drastic network modifications in a single step and also showed superior performance even for small sample sizes. Samplers such as STR and 1PB showed only slight or no performance gains with the increase of sample size (Figure 4C). This observation suggests method selection is crucial in benefiting from larger sample sizes. MCMC chain type also showed a significant interaction with structural samplers. We employed two types of MCMC chains: a single longer chain and multiple shorter chains. Our results indicated that STR, 2PB, 3PB, 4PB, c3PB, and c4PB performed better with a longer single chain, whereas 1PB, c2PB, and REV performed better with shorter multiple chains (Figure 4C). The differences between these two types of samplers could be generated by differences in the time required to reach a stationary distribution. Specifically, in the case of samplers with slow convergence such as STR, 4PB, and c4PB, each short chain was terminated before reaching the stationary distribution. Therefore, the average Bayesian information criterion (BIC) of multiple chains is higher than that of a single chain (Figure S3), which resulted in lower AUCPRs. Conversely, the samplers including 1PB, c2PB, and REV that reach a stationary distribution faster would have potential benefits from exploring optimal networks over the large network space by employing shorter multiple chains from different initial networks. We note, however, that the minimum BIC of each sampler did not depend on the convergence speed (Figure S3). This suggests that some samplers with fast convergence speed such as 1PB can reach suboptimal networks quickly, but are not able to exit from the suboptimal configuration state. To classify MCMC samplers based on their performance, we used hierarchical clustering analysis of AUCPR and AUCROC results (Figure 5). In this analysis, we treated a single longer chain and multiple shorter chains as separate methods. The clustering analysis based on the AUCPR result suggested MCMC samplers can be clustered into four types, which were driven by differences in performance on four primary types of data sets (Figure 5). Examination of network characteristics associated with the four groups of data sets revealed the number of genes, edge density, and gene-to-gene signal strength are major characteristics that differentiate the performance of various samplers (Figure S4A). In AUCPR-based clustering, 1PB, 2PB, and 3PB running with multiple shorter chains were in the same group and showed strong performance, especially for networks with high numbers of genes, high edge density, and high gene-to-gene signal strength. The top-performing samplers, c2PB, c3PB, and REV, were also clustered together and worked well across many simulations, except for networks with low gene-to-gene signals. STR and 4PB fell into the same cluster, showing poor performances over most network results. In the AUCROC-based clustering, we also observed four clusters of MCMC samplers and four groups of datasets (Figure 5). The three network characteristics, numbers of genes, edge density, and gene-to-gene signal strength were again associated with the four groups of data sets (Figure S4B). This further supports the large impact of these three network parameters on the accuracy of the reverse engineering of gene-regulatory networks. Similar to the AUCPR result above, 1PB, 2PB, and 3PB and c2PB, c3PB, and REV clustered into the same group (Figure 5). However, the strongest factor contributing to the cluster separation was the chain type employed. Running with multiple chains resulted in clearly higher AUCROC than with single-chain execution.

Figure 5

Cluster analysis for MCMC samplers. Top panels contain the heatmaps of the normalized AUCPR and AUCROC and associated dendrograms obtained through hierarchical clustering of the results. The rows indicate MCMC samplers running either single chain or multiple chains. The columns indicate simulated networks. Bottom panels represent the number of genes, edge density, and gene-to-gene signal averaged across networks clustered together in the dendrograms.

Detailed comparison of top-performing samplers

Next we investigated the differences among the top-performing samplers, REV, c2PB, and c3PB. To evaluate relative performance of any pair of structure samplers, A and B, we used single-replicate multiway ANOVA, with the response given by W = AUCPR – AUCPR. We focus here on the AUCPR results from a single-chain condition to reveal differences in characteristics across samplers. Two parameters, gene-to-gene signal strength and network topology, had significant effects (P < 1e-16) on the performance of c3PB relative to REV (Figure 6). As the strength of gene-to-gene signaling increased, the c3PB performance improved considerably over that of REV. Also c3PB performed significantly better than REV for the networks with EIPO topology compared to networks with random topology. Compared to REV, c2PB also performed slightly better with increased gene-to-gene signaling (Figure 6). None of the other parameters were significantly associated with the difference between c2PB and REV, indicating these two samplers are almost comparable in terms of the type of networks where they work well. Finally, two parameters, gene-to-gene signal and network topology, are associated with the difference between c3PB and c2PB, just as they are between c3PB and REV (Figure 6). These results demonstrate that a newly designed sampler, c3PB, is superior to REV, especially for highly correlated biological networks.

Figure 6

Pairwise comparison of top-performing samplers. Shown are marginal effect plots for the comparison between REV, c2PB, and c3PB with a single chain running. The dashed horizontal line represents a baseline representing equivalent performance.

Evaluation of the effect of genetic information

We designed our algorithm to incorporate SNPs as nodes in a Bayesian network, which means we estimated the structure of gene-to-gene networks and the eSNP mapping simultaneously. The incorporation of SNPs creates new sets of conditional independence relations, allowing us to distinguish between gene regulatory networks with equivalent likelihood. Furthermore, by including SNPs in the model we can potentially improve the model fit, leading to higher network recovery scores. We evaluated the benefit of incorporating SNP information for both the accuracy of the estimated network and the model fit to gene expression data, as follows. First, gene expression data and SNP data were simulated from networks composed of genes and SNPs. Then, gene expression data alone or both gene expression and SNPs were applied to the Bayesian network estimation programs. Our results show that AUCPR and average BIC score significantly improve when both gene expression data and SNP data are used for reverse engineering (Figure 7A). We also investigated parameters that affected the performance gain when SNPs were incorporated (Figure 7B). As the strength of SNP-to-gene signal ramped up, the benefit of incorporating SNPs also increased the network reconstruction accuracy. Conversely, as the number of genes, edge density, and noise level increased, the beneficial effect of SNPs became more limited. The effects on inferred network accuracy that stem from incorporating SNPs are independent of MCMC sampler, sample size, and network topology. Our simulation study indicates that SNP information is often helpful to estimate Bayesian networks, but might be less effective on data sets generated by real networks that are large, are noisy, or have dense sets of interactions.

Figure 7

Effect of genetic information on network accuracy. (A) Marginal AUCPR and average BIC plots for the use of genetic information. The average BIC is calculated as mean of BIC of sampled networks in each MCMC run (lower BIC scores are better). (B) Marginal effect plots for the comparison between the Bayesian network reconstruction based on gene expression and SNP information vs. gene expression only. The dashed horizontal line represents a baseline representing equivalent performance. ΔAUCPR is defined as the difference of all AUCPRs from simulations with genetics minus all AUCPRs from simulations without genetics.

Discussion

Reverse engineering of gene networks from systems genetics data (Jansen 2001) is an active research area. Although we focused on MCMC samplers for gene network structure learning in the present article, many other statistical approaches/frameworks have been proposed in the literature. For instance, Chaibub Neto applied the PC algorithm (Spirtes ) to first infer the skeleton of the gene network and then used expression QTL (eQTL) to determine the directions of the edges in the phenotype network. Liu proposed a two-step approach wherein an encompassing directed network (EDN) is first generated from assembled pairwise regulator–target relationships (which might be direct or indirect), followed by the application of structural equation models to search for a (possibly) cyclic network nested in the EDN, which best fits the data according to the BIC criteria. By restricting the network space to networks nested in the EDN, this approach is able to handle networks with hundreds of genes and eQTL. Logsdon and Mezey (2010) proposed a multiple-step approach where an association analysis is carried out to identify local eQTL, followed by the inference of an undirected network of gene expression traits and eQTL nodes via a covariance selection approach based on the adaptive lasso feature selection procedure and subsequent mapping of the undirected network into a possibly cyclic gene expression network. Zhang and Kim (2014) adopted sparse conditional Gaussian graphical models for modeling undirected gene networks under SNP perturbations, where the gene network structure learning task and the SNP feature selection task are performed jointly by solving a single optimization problem containing the lasso (for SNP feature selection) and the graphical lasso (for gene network structure learning) as special cases. A common feature of these non-Bayesian approaches is that the output of the reverse-engineering algorithm is a single point estimate of the gene network, and no statistical measure of uncertainty about the inferred network is available. The MCMC samplers, on the other hand, output an entire posterior distribution of network structures, from which Bayesian model-averaged estimates of the probability for the presence and direction of any particular edge in the network are readily available. This provides information about which parts of the inferred network were reconstructed with stronger confidence. This feature is particularly important for reverse engineering of gene networks since expression data are notoriously noisy. The main drawback is the increased computational requirements, compared to non-Bayesian/point estimate approaches, which are generally faster and scalable to larger networks. Nonetheless, MCMC approaches have been successfully applied to genome-scale real data (Zhu ; Zhang ) by first clustering the gene expression data into much smaller groups via weighted gene coexpression analysis (Zhang and Horvath 2005), followed by Bayesian networks reconstruction for the separate and more manageable gene clusters. MCMC samplers based on moves in the space of node orders (Friedman and Koller 2003; Eaton and Murphy 2007; Ellis and Wong 2008) have been shown to considerably improve the mixing of the Markov chain. Nonetheless, these samplers do not allow the explicit specification of prior distributions over network structures. Since our main applied interest is the reconstruction of Bayesian networks with noisy genomic data, where the incorporation of prior knowledge can improve reconstruction performance, we focus our attention on MCMC approaches based on moves in network structure space. To investigate network inference in the context of data sets that are likely encountered by experimental biologists, we conducted the largest simulation study comparing MCMC samplers for structure learning in Bayesian networks to date. By ranging over combinations of biologically plausible parameter settings, we can be confident in our conclusions about the relative performance of different inference methods. The variability in generative network structures also allows us to understand how network characteristics affect the performance of various network inference methods. Specifically, our simulation study is designed in the spirit of a multifactorial experiment with crossed factors, where simulation parameters play the role of factors. This allows us to investigate the effect of each parameter on the accuracy of Bayesian networks, in addition to comparing performance of MCMC samplers. The average edge density and the average gene-to-gene signal significantly affected the learning performance. As the average edge density increased, the learning performance of every method decreased monotonically. One possible reason for this behavior is that for Bayesian network inference, we restricted the number of parents to < = 3, to reduce the size of the network space to evaluate. On the other hand, we did not pose any restriction on the number of parents in simulated networks where the maximum edge density was set as 0.06, which means each node in networks with size 30, 65, and 100, is expected to have 1.8, 3.9, and 6 parents on average, respectively, assuming random network topology. Because of this limitation, it is possible that we miss some true incoming regulatory relationships for genes that are regulated by many other genes. However, in real applications, since screening the genes with many children is likely to be a predominant question of interest, this limitation may not be critical. Another reason for the generally harmful effect of edge density on performance could be the correlation strength of the data set. In our simulation framework, the average gene-to-gene correlation is influenced by edge density and the strength of gene-to-gene signaling. Generally, it is difficult to estimate true regulations in a highly correlated data set, since the Markov equivalence classes of a directed network cannot be identified uniquely (Uhler ). STR, the simplest sampler, showed the worst performance against networks with high edge density and high gene-to-gene signal, probably due to being trapped in suboptimal structures. This limitation has practical consequences for recent attempts to define causal networks within gene coexpression networks, since those coexpression networks are composed of correlated gene variables. However, our new sampler, c3PB, showed better performance compared with STR as well as REV, for the networks with high gene-to-gene signal, indicating c3PB works well for elucidating complex and correlated biological networks. Our multifactorial simulation was designed to cover many biologically plausible conditions, but there still exist a few common assumptions underlying the simulation and the inference that might bias our comparison results. For instance, during the Bayesian network estimation process, we restricted the number of parents per node. Since the amount of computation required for each MCMC iteration differs greatly among the samplers, this restriction may affect the performance of each sampler differently. Specifically, since REV and higher-order PBs samplers need to calculate all the parent set combinations in every MCMC iteration, the number of parents affects the number of iterations completed in a fixed time window. On the other hand, the STR sampler does not need to keep track of the parent sets. However, the inherent poor mixing of the STR sampler becomes an even more important issue when no restriction on the number of parents is imposed, as the sampler needs to search over a larger network space and a larger number of iterations is required for the convergence of the chain. Another assumption in our study is that 20% of the genes have eSNPs, and no more than two SNPs per gene are allowed. While these conditions are based on characteristics of real data (Stranger ), they still influence the extent of the SNP’s contribution to the gene regulatory system, and different samplers might respond differently to the distinct amounts of genetic influence. One factor that partially mitigates this limitation is that in addition to SNP frequency, other factors such as the strength of the SNP-to-gene associations also contribute to the total percentage of variance explained by SNPs in real systems. While we fix the number of genes with eSNPs, we still examine a range of scenarios for percentage of variance explained by SNPs, because we utilize three levels of SNP-to-gene coupling. Zhu provided a detailed simulation investigation of the effect of the integration of genetic and expression data in Bayesian network reconstruction performance. They focused on simulations from a single (biologically motivated) network structure, in contrast to our present study investigating many characteristics across 1458 distinct and biologically plausible networks. In Zhu , the authors concluded that the integration of SNP and expression data can greatly improve network reconstruction. In our simulations, we also observed that the genetic information has beneficial effects, but found that the effect size for genetics is smaller than the effect of other parameters, such as the particular method employed. We further clarified the benefit of genetics by showing that the advantage of including genetics was maximized when the strength of SNP-to-gene signal was high and the intrinsic noise was low. Signal strength and noise level actually determine the extent of SNPs’ contribution to gene expression levels. The middle range of signal strength and noise level corresponds to an average effect size of eSNPs commonly observed in real data. At this level of effect size, improper selection of an inference method and running setting of the MCMC chain potentially diminishes the benefit of utilizing SNP information. Alternatively, for systems where SNPs strongly influence expression, incorporating SNP data can produce a greater marginal increase in performance than the choice of MCMC method. For instance, incorporation of SNP data will likely be most beneficial for causal network inference of gene expression phenotypes clustered in strong hotspots. In our generative model, single SNPs regulated the expression level of single genes. This setting might underestimate the contribution of SNP to gene expression, since pleiotropic effects of SNPs have been noted (Wagner and Zhang 2011) and a single SNP might be able to regulate multiple genes directly through remodeling of chromatin structure and sharing functional DNA motifs. From the viewpoint of eSNP identification, the mapping of SNPs in the context of gene-to-gene networks potentially increases power and reduces false positives for detecting eSNPs (Chaibub Neto ). Latent variables such as influences from other genes or experimental conditions can mask true signals from SNPs. Stegle proposed the removal of such effects from the expression data prior to eSNP analysis. SNP-incorporated Bayesian networks jointly infer the effects of SNPs and latent effects from other genes and thus are promising tools for investigating genetic architecture. In this study, we extended the block Gibbs samplers proposed by Goudie and Mukherjee (2011) to, in theory, include an arbitrary number of parent sets. As a practical demonstration, we test one to four parent sets, to understand how higher-order blocking contributes to performance. While the performances of 2PB and 3PB are clearly superior to that of 1PB, the performance of 4PB decreased substantially (Figure 4A). Although the use of higher-order parent blocks improves the mixing of the Markov chain, it also exponentially increases the amount of computation necessary to keep track of the allowable combinations of parent sets, so that the number of steps completed by the sampler in a fixed time window is decreased. Hence, in spite of the better mixing of the chain, we still observe a performance drop. These results indicate that both drastic modifications for escape from suboptimal networks and computational efficiency are important in effective sampler operation. We designed “connected” block Gibbs methods, as a hybrid of the classic block Gibbs and REV methods, motivated by limitations of the older methods exposed by our simulation framework. At a given time step, both REV and 2PB update the parent sets of two nodes. However, the performance of 2PB was inferior to that of REV (Figure 4A). This result was unexpected because Gibbs samplers accept network modification at every step and thus are potentially more efficient than a Metropolis–Hastings sampler. A major difference between 2PB and REV that accounts for this is that 2PB updates parent sets of every pair of two nodes sequentially, whereas REV updates only parent sets of the two connected nodes. Therefore, we designed c2PB, which performs 2PB updates for only the two connected nodes at each step and also extended c2PB to ones with higher-order blocking, namely c3PB and c4PB. Figure 4A shows our connected PB samplers clearly improve the formal PB Gibbs samplers, resulting in comparable or better performance than that of REV. The performance improvements of cPB likely result from two sources. First, the speed of chain convergence is higher in cPBs compared with corresponding PBs (Figure S3). Second, cPBs can reach networks with lower BIC, meaning optimized network structures explain the data set better (Figure S3). These improvements are also evident when comparing c2PB with REV. cPBs can be seen as weighted versions of a PB sampler. Specifically, each gene is not updated with equal frequency, but rather the update probability for each gene is weighted based on the current network structure. Since cPBs weight all linearly connected gene pairs uniformly, a more complex weighting procedure may be useful to explore in the future. The relatively close performance of top-performing methods c2PB, c3PB, and REV (Figure 4) compared to other methods raises the possibility of a performance ceiling in Bayesian network reconstruction for biological data sets. While performance among these top methods differs substantially for certain classes of networks, does their collective performance indicate anything about the future of network inference? The issue at the heart of performance limitations on network inference is finding reasonable accurate network structures amid the enormous space of possible networks, which is superexponential with the number of nodes. For instance, networks from our “medium-size” simulations containing 65 nodes have a number of possible configurations that exceed the estimated number of atoms in the observable universe. Both algorithmic advances in network update steps or brute-force computational approaches could be used to improve network inference, with more efficient or more comprehensive approaches to exploring this huge search space. Our method comparison primarily focuses on algorithmic advances, although computational constraints also influenced the results, especially for inefficient methods. From a computational perspective, starting from many different random networks can help to avoid becoming stuck in local minima, as can performing more drastic update moves. Our results for single vs. multiple chains indicate that 1PB, c2PB, c3PB, and REV methods that converge rapidly will benefit the most from a massively parallel approach to finding optimal networks. But results from the classic STR method demonstrate that even an excess of computational resources is not sufficient to overcome a method that is easily trapped in local optima. At the same time, limitations on computational capabilities can hold back algorithms that should produce superior performance. For instance, we see increasing performance across 1PB, 2PB, and 3PB, yet performance of 4PB falls off, as computational overhead overcomes its theoretical advantages. Therefore, based on this simulation study, our general advice for future applications of the methods tested here is to devote substantial computational resources to the latest methods, such as c3PB, that are able to benefit from these resources. If a data set is known to have particular characteristics such as high levels of noise, etc., then the recommended method will shift to one that is known to function well under that regime, although we note that some methods such as c3PB function well under a majority of tested scenarios. One practical goal of network inference methods in early-stage drug discovery is to identify a small number of genes that control a specific molecular network or a disease signature. Perturbation experiments in model disease systems indicate that causal networks can indeed predict some disease-relevant genes and downstream targets (Chen ; Yang ). While large-scale causal network inference has been most frequently applied to data from mouse crosses whose common genetic background creates strong eQTL, these methods have also been applied to paired genetics and gene expression data sets from humans (Zhang ). However, these previous applications employed variants of the STR method, which is outperformed by all other methods tested here. While previous studies have made some correct predictions, there are also numerous false positive and false negative predictions, based on comparing the expected vs. actual (measured) downstream targets of perturbed genes in model systems (Zhang ). The greater accuracy of top-performing methods is therefore key to deriving accurate predictions from human expression data sets, with or without genetics. Because many gene expression studies do not include paired genetics data, correlation-based coexpression networks are frequently used to attempt to identify key disease genes (Gaiteri ). Specifically, hub genes within disease-correlated networks are often invoked as key disease genes. However, coexpression hubs are not necessarily causal in generating a phenotype of gene signature, because a gene at the top of a regulatory cascade may have only a few direct interactions. Such genes may be ignored when focusing on coexpression hub genes. Similarly, hub genes may be subject to incoming regulatory relationships, which cannot be detected in an undirected coexpression framework. Our simulations show that for typical human data sets, even in the absence of genetic priors, we can identify conditional dependence networks of genes that accurately reflect the true regulatory structure. This sparse hierarchical output could be utilized to aid in selecting genes that control a particular disease-relevant molecular system. Causal networks could even be useful in defining relationships across different diseases, to find directed interactions that mediate comorbidity. However, such data sets likely involve >100 genes, which entails a large network search space. The potential of causal inference to identify human regulatory networks among a large number of genes, with or without genetic priors, reinforces the need to use top-performing methods that do not become trapped in suboptimal/inaccurate network states.

34 in total

1. An integrative genomics approach to infer causal associations between gene expression and disease.

Authors: Eric E Schadt; John Lamb; Xia Yang; Jun Zhu; Steve Edwards; Debraj Guhathakurta; Solveig K Sieberts; Stephanie Monks; Marc Reitman; Chunsheng Zhang; Pek Yee Lum; Amy Leonardson; Rolf Thieringer; Joseph M Metzger; Liming Yang; John Castle; Haoyuan Zhu; Shera F Kash; Thomas A Drake; Alan Sachs; Aldons J Lusis
Journal: Nat Genet Date: 2005-06-19 Impact factor: 38.330

Review 2. The pleiotropic structure of the genotype-phenotype map: the evolvability of complex organisms.

Authors: Günter P Wagner; Jianzhi Zhang
Journal: Nat Rev Genet Date: 2011-03 Impact factor: 53.242

3. A Bayesian framework for inference of the genotype-phenotype map for segregating populations.

Authors: Rachael S Hageman; Magalie S Leduc; Ron Korstanje; Beverly Paigen; Gary A Churchill
Journal: Genetics Date: 2011-01-17 Impact factor: 4.562

4. Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations.

Authors: Benjamin A Logsdon; Jason Mezey
Journal: PLoS Comput Biol Date: 2010-12-02 Impact factor: 4.475

5. Patterns of cis regulatory variation in diverse human populations.

Authors: Barbara E Stranger; Stephen B Montgomery; Antigone S Dimas; Leopold Parts; Oliver Stegle; Catherine E Ingle; Magda Sekowska; George Davey Smith; David Evans; Maria Gutierrez-Arcelus; Alkes Price; Towfique Raj; James Nisbett; Alexandra C Nica; Claude Beazley; Richard Durbin; Panos Deloukas; Emmanouil T Dermitzakis
Journal: PLoS Genet Date: 2012-04-19 Impact factor: 5.917

6. Harnessing naturally randomized transcription to infer regulatory relationships among genes.

Authors: Lin S Chen; Frank Emmert-Streib; John D Storey
Journal: Genome Biol Date: 2007 Impact factor: 13.583

7. Genetic networks of liver metabolism revealed by integration of metabolic and transcriptional profiling.

Authors: Christine T Ferrara; Ping Wang; Elias Chaibub Neto; Robert D Stevens; James R Bain; Brett R Wenner; Olga R Ilkayeva; Mark P Keller; Daniel A Blasiole; Christina Kendziorski; Brian S Yandell; Christopher B Newgard; Alan D Attie
Journal: PLoS Genet Date: 2008-03-14 Impact factor: 5.917

8. Network-based analysis of affected biological processes in type 2 diabetes models.

Authors: Manway Liu; Arthur Liberzon; Sek Won Kong; Weil R Lai; Peter J Park; Isaac S Kohane; Simon Kasif
Journal: PLoS Genet Date: 2007-06 Impact factor: 5.917

9. Mapping cis- and trans-regulatory effects across multiple tissues in twins.

Authors: Elin Grundberg; Kerrin S Small; Åsa K Hedman; Alexandra C Nica; Alfonso Buil; Sarah Keildson; Jordana T Bell; Tsun-Po Yang; Eshwar Meduri; Amy Barrett; James Nisbett; Magdalena Sekowska; Alicja Wilk; So-Youn Shin; Daniel Glass; Mary Travers; Josine L Min; Sue Ring; Karen Ho; Gudmar Thorleifsson; Augustine Kong; Unnur Thorsteindottir; Chrysanthi Ainali; Antigone S Dimas; Neelam Hassanali; Catherine Ingle; David Knowles; Maria Krestyaninova; Christopher E Lowe; Paola Di Meglio; Stephen B Montgomery; Leopold Parts; Simon Potter; Gabriela Surdulescu; Loukia Tsaprouni; Sophia Tsoka; Veronique Bataille; Richard Durbin; Frank O Nestle; Stephen O'Rahilly; Nicole Soranzo; Cecilia M Lindgren; Krina T Zondervan; Kourosh R Ahmadi; Eric E Schadt; Kari Stefansson; George Davey Smith; Mark I McCarthy; Panos Deloukas; Emmanouil T Dermitzakis; Tim D Spector
Journal: Nat Genet Date: 2012-09-02 Impact factor: 38.330

10. Overlap of expression quantitative trait loci (eQTL) in human brain and blood.

Authors: Marna McKenzie; Anjali K Henders; Anthony Caracella; Naomi R Wray; Joseph E Powell
Journal: BMC Med Genomics Date: 2014-06-03 Impact factor: 3.063

9 in total

1. High-Dimensional Bayesian Network Inference From Systems Genetics Data Using Genetic Node Ordering.

Authors: Lingfei Wang; Pieter Audenaert; Tom Michoel
Journal: Front Genet Date: 2019-12-20 Impact factor: 4.599

2. New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data.

Authors: Grigoriy Gogoshin; Eric Boerwinkle; Andrei S Rodin
Journal: J Comput Biol Date: 2016-09-28 Impact factor: 1.479

Review 3. Genetic variants in Alzheimer disease - molecular and brain network approaches.

Authors: Chris Gaiteri; Sara Mostafavi; Christopher J Honey; Philip L De Jager; David A Bennett
Journal: Nat Rev Neurol Date: 2016-06-10 Impact factor: 42.937

4. Synthetic data generation with probabilistic Bayesian Networks.

Authors: Grigoriy Gogoshin; Sergio Branciamore; Andrei S Rodin
Journal: Math Biosci Eng Date: 2021-10-09 Impact factor: 2.080

5. Gene expression data analysis using Hellinger correlation in weighted gene co-expression networks (WGCNA).

Authors: Tianjiao Zhang; Garry Wong
Journal: Comput Struct Biotechnol J Date: 2022-07-13 Impact factor: 6.155

6. Multi-omic Directed Networks Describe Features of Gene Regulation in Aged Brains and Expand the Set of Genes Driving Cognitive Decline.

Authors: Shinya Tasaki; Chris Gaiteri; Sara Mostafavi; Lei Yu; Yanling Wang; Philip L De Jager; David A Bennett
Journal: Front Genet Date: 2018-08-09 Impact factor: 4.772

7. Identifying the molecular systems that influence cognitive resilience to Alzheimer's disease in genetically diverse mice.

Authors: Sarah E Heuer; Sarah M Neuner; Niran Hadad; Kristen M S O'Connell; Robert W Williams; Vivek M Philip; Chris Gaiteri; Catherine C Kaczorowski
Journal: Learn Mem Date: 2020-08-17 Impact factor: 2.460

8. Genetic risk for Alzheimer's dementia predicts motor deficits through multi-omic systems in older adults.

Authors: Shinya Tasaki; Chris Gaiteri; Vladislav A Petyuk; Katherine D Blizinsky; Philip L De Jager; Aron S Buchman; David A Bennett
Journal: Transl Psychiatry Date: 2019-10-03 Impact factor: 6.222

9. Causal Datasheet for Datasets: An Evaluation Guide for Real-World Data Analysis and Data Collection Design Using Bayesian Networks.

Authors: Bradley Butcher; Vincent S Huang; Christopher Robinson; Jeremy Reffin; Sema K Sgaier; Grace Charles; Novi Quadrianto
Journal: Front Artif Intell Date: 2021-04-14

9 in total