Literature DB >> 36087049

A review of causal discovery methods for molecular network analysis.

Jack Kelly¹, Carlo Berzuini¹, Bernard Keavney^2,3, Maciej Tomaszewski^2,4, Hui Guo¹.

Abstract

BACKGROUND: With the increasing availability and size of multi-omics datasets, investigating the casual relationships between molecular phenotypes has become an important aspect of exploring underlying biology andgenetics. There are an increasing number of methodlogies that have been developed and applied to moleular networks to investigate these causal interactions.
METHODS: We have introduced and reviewed the available methods for building large-scale causal molecular networks that have been developed and applied in the past decade.
RESULTS: In this review we have identified and summarized the existing methods for infering causality in large-scale causal molecular networks, and discussed important factors that will need to be considered in future research in this area.
CONCLUSION: Existing methods to infering causal molecular networks have their own strengths and limitations so there is no one best approach, and it is instead down to the discretion of the researcher. This review also to discusses some of the current limitations to biological interpretation of these networks, and important factors to consider for future studies on molecular networks.

Entities: Chemical

Keywords: Bayesian networks; causal inference; causal molecular network; mendelian randomisation; omics

Mesh：
Causality
Phenotype

Year: 2022 PMID： 36087049 PMCID： PMC9544222 DOI： 10.1002/mgg3.2055

Source DB: PubMed Journal: Mol Genet Genomic Med ISSN： 2324-9269 Impact factor: 2.473

INTRODUCTION

Molecular networks are important to understanding biological process beyond the analysis of a single gene or molecule (Han, 2008). The operation of molecular phenotypes at all levels is not isolated and interactions make up complicated networks that contain a wealth of information. In an age where data is being produced more than ever, these networks can become increasingly complex. A molecular network contains a set of nodes and edges. Nodes represent information from multi‐omics, including but not limited to genes, messenger RNAs (mRNAs), proteins, DNA methylation patterns and protein phosphorylation. Edges represent the relationship between the nodes and so can symbolise direct and indirect relationships between molecular phenotypes and transcriptional regulation. One of the primary advantages of molecular networks is in elucidating genetic and biological mechanisms underlying disease. Even in diseases with known causative genes (eg. CFTR mutation causing Cystic fibrosis (Elborn, 2016) and mutations in HTT leading to Huntington's disease (Ha & Fung, 2012)) these genes act as part of a large network and never in isolation. Dysregulated biological processes and important ‘hubs’ within them can be identified as disease drivers, which potentially help identify drug targets that impact sets of associated genes rather than important individual genes, though this has yet to be translated to clinically useful therapies (Chagoyen et al., 2019). Undirected networks have been an important approach for the investigation of biological processes and identification of hub genes in disease. Traditionally, protein–protein interaction networks have been built using a combination of in vivo and in vitro methods to understand interactions, however these approaches have huge time and financial costs, and result in noisy networks with high false positive rates (Rao et al., 2014). Approaches to omics data using in silico methods have been used as an alternative to better understand these undirected associations (Kotlyar et al., 2015). Most commonly, co‐expression molecular networks are built on the basis of correlation structures (Villa‐Vialaneix et al., 2013). It has become popular to use specific R software to infer undirected networks from transcriptomics data. For example, weighted gene co‐expression network analysis (WGCNA) (Langfelder & Horvath, 2008) is particularly user‐friendly as the authors have produced extensive tutorials and guides to increase accessibility to researchers. Although providing limited mechanistic understanding, undirected networks are important as they are often precursors of the study of causal networks. Many undirected networks (as shown in Figure 1a) rely on using correlation between nodes to infer symmetric associations. However, causal networks aim to differentiate the directed regulatory relationships from just associations. This approach identifies directed (as shown in Figure 1b) or mixed networks (as shown in Figure 1c). It is worth noting that directed relationships in a network do not necessarily have a causal interpretation, as they may merely depict temporal orders in the data generating process. Only if the confounders between the nodes have been adjusted for will these relationships have a causal meaning.

FIGURE 1

(a) an example of an undirected network, (b) a directed network and (c) a mixed network. Mixed networks have both directed and undirected edges.

(a) an example of an undirected network, (b) a directed network and (c) a mixed network. Mixed networks have both directed and undirected edges. Identifying causal relations from gene expression data was proposed over 20 years ago (Friedman et al., 2000). Since then, a large number of causal inference methods have been developed using omics data. This approach is advantageous in the study of biology as it allows for inferring causality without interventions, especially when randomised controlled trials are infeasible due to high cost and ethical issues (White & Vignes, 2019). As the technology becomes more accessible and affordable, there is an increasing range of omics data that is being collected, which allows for integrative analysis to develop a more complete picture of how different types of omics interact with one another (Eales et al., 2021). Causal inference in molecular networks is a growing area of research. However, complex high dimensional causal networks have limited use and their contribution to the literature is heavily restricted as they are often difficult to interpret. There needs to be approaches that allow for identification of biologically important sub‐networks and a small number of targets for future research or therapeutic intervention. In this review, we will discuss the current literature using causal discovery methods on molecular networks and challenges that the area is facing. We will also discuss factors that influence interpretation of causal networks, including clustering and visualisation. Previous reviews (Glymour et al., 2019; Yazdani, 2020) have focussed on introducing methodologies of building causal networks and given few biological examples, however here we will focus on published methods and their applications specifically to molecular networks and subsequent biological interpretation.

CAUSAL MOLECULAR NETWORKS

Applications of different causal methods to omics data is covered in this review. The simplest causal network only involves the causal relationship between a pair of variables, investigating whether a single exposure can cause a single outcome. Causal networks can be made increasingly complex to investigate the relationships between thousands of variables. With applications to molecular phenotype data, the main approaches used to build causal networks have been Mendelian randomisation (MR) and Bayesian networks (BN), including the PC algorithm, as shown in Figure 2. Here, we consider MR, approaches to BNs and we then focus on a combination of approaches to reduce the limitations of any single method. A summary of the methodologies discussed here are shown in Table 1.

FIGURE 2

TABLE 1

Summary of the discovery methods for analysis of causal molecular networks including the software available

Methodologies	Data source required	Advantages	Disadvantages	Software available
Mendelian Randomisation
Mendelian randomisation (MR)	GWAS, omics	Only requires summary statistics, fast to run Estimates causal effect size	Data must meet certain (possibly untestable) assumptions Incapable of modelling complex relationships	MendelianRandomization (R package) (Yavorska & Burgess, 2017) TwoSampleMR (R package) (Hemani et al., 2018) MR‐Base (Hemani et al., 2018)
Bayesian MR	GWAS, omics	Flexibility of modelling complex data structure (overlapping samples, horizontal pleiotropy, interactions, multiple exposures) Estimates causal effect size	Data must meet certain (possibly untestable) assumptions Computationally intensive, applicable to small‐to‐medium causal networks
Bayesian networks
Bayesian networks (BN)	omics	Can generate larger causal networks Causal edges probabilities are given Estimates causal effect size	Computationally intensive limiting network size	Bnlearn (R and Python package) (Scutari, 2010) BayesNetty (Howey et al., 2020, 2021)
PC algorithm (Spirtes et al., 2000)	omics	Relatively fast compared to other BNs	Although faster than alternatives, computationally challenging when run on very large datasets Causal effect size is not inferred	Bnlearn (R and Python package) (Scutari, 2010) Pcalg (R package) (Hauser & Bühlmann, 2012; Kalisch et al., 2012)
Combination methods
Bayesian and MR (findr (Wang et al., 2019)/MRPC (Badsha & Fu, 2019))	GWAS, omics	Undirected network construction followed edge directions inferred using MR	Still computationally intensive and applications have been on subsets of omics data Causal effect size is not inferred	findr (R package) (Wang et al., 2019) MRPC (R package) (Badsha & Fu, 2019)
Genome Granularity DAG (GDAG) (Yazdani et al., 2016a)	GWAS, omics	Undirected network construction followed edge directions inferred using MR	Still computationally intensive and applications have been on subsets of omics data Causal effect size is not inferred
Causal Graphical Analysis Using GEnetics (cGAUGE) (Amar et al., 2021)	GWAS, omics	Approach has greater power and lower false discovery rate than BNs	Computationally intensive Causal effect size is not inferred Greater power than BN, reduced power in presence of horizontal pleiotropy	cGUAGE (R code: https://github.com/david‐dd‐amar/cGAUGE) (Amar et al., 2021)
Time series causal networks
Granger causality (Granger, 1969)	Time series omics	Allows causal inference using time series omics data	Time intervals between measurements needs to be enough for a noticeable change to take place Needs to be no confounders	lmtest (R package) (Zeileis & Hothorn, 2002) statsmodels.tsa.stattools.grangercausalitytests (Python package) (Seabold & Perktold, 2010)
Optimal Causation Entropy (OCE) (Sun et al., 2015)/PCMCI (Runge et al., 2019)	Time series omics	Outperform Granger causality using time series omics data Can generate large scale causal networks	Assumes stationarity which can be violated by confounders	TIGRAMITE (Python package) (Runge et al., 2019)

(a) Schematic representation of MR. MR infers the causal effect of an exposure (phenotype) on the outcome using instrumental variables (IVs). (b) Causal Bayesian networks connect nodes via directed edges determined by conditional independence, which is present when the relationship between two nodes is independent conditioning on all other nodes in the graph. (c) Schematic representation of the PC algorithm. The true causal graph is shown in (b). The PC algorithm initially begins with an undirected fully connected graph (i) and uses data to create a skeleton graph with undirected edges. In this case, the X1 − X2 edge is removed because X1 is independent of X2 (ii) and the edges between X1 − X4 are removed as the nodes are independent given X3 . The same is true for the X2 − X4 edge (iii). Then v‐structures are identified (iv) and final edges oriented (v) (Le et al., 2019). Summary of the discovery methods for analysis of causal molecular networks including the software available Only requires summary statistics, fast to run Estimates causal effect size Data must meet certain (possibly untestable) assumptions Incapable of modelling complex relationships MendelianRandomization (R package) (Yavorska & Burgess, 2017) TwoSampleMR (R package) (Hemani et al., 2018) MR‐Base (Hemani et al., 2018) Flexibility of modelling complex data structure (overlapping samples, horizontal pleiotropy, interactions, multiple exposures) Estimates causal effect size Data must meet certain (possibly untestable) assumptions Computationally intensive, applicable to small‐to‐medium causal networks Can generate larger causal networks Causal edges probabilities are given Estimates causal effect size Computationally intensive limiting network size Bnlearn (R and Python package) (Scutari, 2010) BayesNetty (Howey et al., 2020, 2021) Relatively fast compared to other BNs Although faster than alternatives, computationally challenging when run on very large datasets Causal effect size is not inferred Bnlearn (R and Python package) (Scutari, 2010) Pcalg (R package) (Hauser & Bühlmann, 2012; Kalisch et al., 2012) Undirected network construction followed edge directions inferred using MR Still computationally intensive and applications have been on subsets of omics data Causal effect size is not inferred findr (R package) (Wang et al., 2019) MRPC (R package) (Badsha & Fu, 2019) Undirected network construction followed edge directions inferred using MR Still computationally intensive and applications have been on subsets of omics data Causal effect size is not inferred Approach has greater power and lower false discovery rate than BNs Computationally intensive Causal effect size is not inferred Greater power than BN, reduced power in presence of horizontal pleiotropy cGUAGE (R code: https://github.com/david‐dd‐amar/cGAUGE) (Amar et al., 2021) Allows causal inference using time series omics data Time intervals between measurements needs to be enough for a noticeable change to take place Needs to be no confounders lmtest (R package) (Zeileis & Hothorn, 2002) statsmodels.tsa.stattools.grangercausalitytests (Python package) (Seabold & Perktold, 2010) Outperform Granger causality using time series omics data Can generate large scale causal networks Assumes stationarity which can be violated by confounders TIGRAMITE (Python package) (Runge et al., 2019)

Mendelian randomisation

MR uses single‐nucleotide polymorphisms (SNPs) as ‘instrumental variables’ (IVs) to infer the causal effect of an exposure on an outcome. It mimics randomised controlled trials by assuming that SNP genotypes are randomly assigned to individuals within a population. MR requires three key assumptions (Figure 2a); (a) IVs are associated with the exposure of interest; (b) IVs are independent of confounders (both observed and unobserved) between exposure and outcome; (c) IVs only affects the outcome through the exposure of interest. Horizontal pleiotropy occurs when the IV influences outcome outside of its effect on the exposure, breaking the assumption that genotype only affects the outcome through the exposure of interest. Several adaptations of MR have been developed to reduce the impact of horizontal pleiotropy. Popular approaches include MR‐Egger (Bowden et al., 2015) (which models pleiotropy assuming that effects of the IV on exposure and outcome are independent), MR‐PRESSO (Verbanck et al., 2018) (which corrects for pleiotropic outlier effects) and Causal Analysis Using Summary Effect estimates (CAUSE) (Morrison et al., 2020) (which accounts for correlated and uncorrelated pleiotropic effects). MR‐PRESSO and MR‐Egger are often both applied to data and results compared to reduce the impact of pleiotropy and outliers. These approaches have been used to provide evidence to support the casual effect of estimated glomerular filtration rate, a measure of kidney function, on chronic kidney disease, kidney stone formation, diastolic blood pressure and hypertension (Morris et al., 2019). Additionally, they have been used to show the causal effect of blood pressure on renal outcomes commonly affecting patients with hypertension (Eales et al., 2021). In most cases discussed above, MR analysis requires the association between IV‐exposure and IV‐outcome are from two independent studies (Lawlor, 2016). This is known as two‐sample MR. There are a limited number of one‐sample MR methods that deal with IVs, exposures and outcomes coming from a single study (Bowden et al., 2015; Zhao et al., 2018). Some expansions to MR have been developed to handle data when two studies have overlapping individuals in common (LeBlanc et al., 2018), which in classic MR approaches lead to bias. Zou et al. (2020) have developed a more flexible Bayesian MR method that can handle one, two and overlapping samples. Bayesian MR has an advantage in its flexibility of coping with complex data structures, such as overlapping samples, horizontal pleiotropy, study heterogeneity and multiple exposure and outcomes, all in a single model (Berzuini et al., 2020; Zou et al., 2020, 2021). Advanced MR methods have been developed more recently, such as MR‐ConMix (contamination mixture method for robust and efficient estimation) (Burgess et al., 2020) and GRAPPLE (Genome‐wide mR Analysis under Pervasive PLEiotropy) (Wang et al., 2021), that utilises both strongly and weakly associated SNPs to identify multiple pleiotropic pathways. Both have discussed the future importance of including multiple exposures in the study of genetics and MR. The Causal Inference Test (cit) (Millstein et al., 2016) is a more conservative method that applies the principles of MR and is more robust to pleiotropic effects and reverse causation. These advancements in MR methodologies provide researchers with more options to design models that better fit the assumptions of MR. Inferring causality using MR has been increasingly applied (Bowden & Holmes, 2019; Nordestgaard & Nordestgaard, 2016), however have been focused on smaller‐to‐medium scale and applications to large scale omics networks have been limited. A thorough review of MR has recently been published by Sanderson et al. (2022). Nevertheless, MR has found applications being used in combination with other approaches to building molecular networks, which will be discussed shortly.

Bayesian networks

Bayesian networks (BNs) were one of the first approaches proposed to investigate gene expression networks (Friedman et al., 2000). BNs use Bayesian inference to calculate probabilistic graphical models of data. BNs are directed acyclic graphs (DAGs) with directed edges and no subset of nodes that can form a closed loop. The edges of the DAG are determined via conditional independence which is present when two nodes are independent conditioning on all other nodes in the graph. An example of a BN is shown in Figure 2b. The two traditional classes of Bayesian networks are constraint‐based and score‐based. Constraint‐based methods learn an undirected network skeleton using conditional independence testing and then assign the direction of edges between nodes that are not found to be independent. Score‐based methods instead aim to optimise a scoring criterion across a search space of DAGs. Additionally, there are hybrid algorithms aggregate constrained and score‐based algorithms which although have been widely applied in building causal network (Li & Guo, 2018), they have had limited applications in the molecular network literature. Due to the high computational cost, most studies have been limited to inferring causal relationships within triplets of a gene regulatory network (Bucur et al., 2019) with limited approaches to scaling networks to larger more complete molecular networks. Much of the literature using BN to infer molecular networks has introduced limitations to the size of the networks built. Mäkinen et al. (2014) used BNs to investigate coronary artery disease, introducing genetic information as priors by not allowing genes that have no associated SNPs to be parents of genes that have an associated SNP. However, this was only done on a subset of genes rather than a full network. Azad and Alyami (2021) used BNs to investigate causal gene expression networks in Lapatinib resistance to better understand why some breast cancer patients have unsuccessful treatment. They used different Markov Chain Monte Carlo (MCMC) sampling algorithms to identify the optimum molecular network from the BN search space. MCMC samples a probability distribution where the next sample is dependent on the current sample. The study was limited to genes within the TGF‐β signalling pathway in lapatinib sensitive and resistant breast cancer cells, identifying the driver genes as being associated with the GO biological terms positive regulation of pathway‐restricted SMAD protein phosphorylation and regulation of lymphocyte. Other approaches to learning BNs using MCMC schemes have been proposed. Castelletti and Consonni (2019) used MCMC to learn the Markov equivalence class of DAGs to investigate protein signalling in observational and interventional samples. This approach requires little tuning as it uses default parameter priors and so is more accessible to researchers than other Bayesian approaches. The authors have also used a Bayesian active learning procedure to identify DAGs (Castelletti & Consonni, 2020) in the same protein signalling dataset and show that DAGs can be identified even when only using a subset of the intervention samples. Similarly, Bhattacharya and Das (2019) applied BNs to investigate causal genes in drug pathways for cancer, using a limited set of known drug target genes and genes identified by machine learning. Using a small dataset, they identified gene to gene connections that play a role in imatinib resistance in chronic myeloid leukaemia, including a ACADVL to PDIA5 connection present uniquely in non‐responder populations. These two proteins have been previously shown to play important roles in cancer drug‐resistance (Higa et al., 2014). Additionally, BNs have been used in the past to identify any causal effects of microRNA (miRNA) on gene expression interactions (Lee & Jiang, 2017). However, these networks are very limited, with causal edges only from miRNA to gene expression and in many cases failed to identify known gene–gene interactions from experiment‐supported databases. Identifying the optimal BN is very difficult, and many approaches have been proposed with the aim to improve this process within transcriptional networks (Azad & Alyami, 2021). For example, Howey et al. have developed BayesNetty (Howey et al., 2020, 2021), an accessible software for building Bayesian networks using genetic and phenotypic data. This software allows users to apply algorithms accessible in the R package bnlearn (Scutari, 2010) to biological relationships. Howey et al. (2020) used BayesNetty to implement the score‐based BN approach called hill climbing to investigate a small number of interactions between metabolites and phenotypic data. They use genetic anchors to ensure there can be no directional edges towards genetic variants and found it outperformed MR in highly pleiotropic scenarios. This software includes approaches that can effectively impute missing data using a version of nearest neighbour imputation and the ability to add weights to certain edges, allowing researchers to incorporate prior knowledge concerning directions between nodes. These improvements have only shown to be moderate and remain computationally intensive for generating large networks. Large amounts of information could be missed if only a subset of data is used to build causal networks which is generally the approach used with BN due to the high computational cost. It is possible to sacrifice accuracy of networks for speed using approximate solutions (Guo & Constantinou, 2020), however this is not guaranteed to make it possible to build networks using data that is as highly dimensional as omics data. The PC algorithm (Spirtes et al., 2000) (named after its initial authors, Peter Spirtes and Clark Glymour) is a constraint‐based approach to estimating Bayesian networks, starting with a fully connected undirected graph and recursively deleting edges based on conditional independence properties. This generates a completed partially DAG (CPDAG) which consists of both directed and undirected edges. The steps the PC algorithm takes to build causal networks are shown in Figure 2c. The PC algorithm is fast for high dimensional and sparse problems, which makes it more suited towards uses with molecular network data (Maathuis et al., 2010). Zhang et al. (2012) used the PC algorithm with gene expression data to identify conditional independence between pairs of genes to build gene regulatory networks. Le et al. (2013) predicted the causal mRNA targets of miRNAs using a method named Intervention‐calculus when the DAG is Absent (IDA) (Maathuis et al., 2010). IDA has been shown to have use in investigating the impact of regulators on gene expression (Ye et al., 2021) but has seen little practical use to investigate disease. Zhang et al. (2014) applied the method to epithelial‐mesenchymal transition and multi‐class cancer datasets and results were validated by transfections experiments. Zhang et al. (2014) used the IDA approach to infer miRNA‐mRNA pair interactions, and identified differences in causal effects between different conditions. They have used IDA to infer causality of long non‐coding RNA (lncRNA) on mRNA within modules identified using WGNCA to identify lncRNAs in specific biological functions (Zhang et al., 2018), an approach that has also since been used to investigate pan‐cancer (Ye et al., 2021). Despite being faster than alternatives, the PC algorithm is still slow when applied to high dimensional datasets, and so as data is integrated runtime will increase (Le et al., 2019). The PC algorithm has seen limited use on its own in applications to molecular networks. However, it has been used more recently in combination with other approaches to infer causality in biological data.

Combination of approaches

Research is trending towards the use of a combination of approaches to building causal molecular networks, with the aim to reduce the limitations of individual approaches and build more robust networks. MR, in particular, has been combined with other methods to help topologies and speed up construction of causal network by putting constraints on edge directions. Yazdani et al. (2016a) proposed an approach to identifying causal networks named genome granularity DAG (GDAG). Initially, strong IVs are generated from phenotype SNP data across each chromosome independently. The structure of the undirected network for omics data is identified, and the principle of MR is used to determine the directionality of edges using the strong IVs generated previously. They have used this approach to investigate the network of metabolites (Yazdani et al., 2016b, 2019). Augmenting Bayesian networks with the principles of MR has become popular for building molecular networks (Yazdani, 2020). Wang et al. (2019) have tried to address the computational limitations of BNs on large‐scale transcriptome‐wide networks using a tool they have named findr. They used the SNPs that are directly associated with gene expression, known as expression quantitative trait loci (eQTLs). For each gene, the most strongly associated eQTL is selected as the IV in inferring the pairwise causal relationships between all genes in the network. These edges are ranked and assembled into a DAG (Wang & Michoel, 2018). This method is much more efficient and outperforms traditional ways of building BNs, though has rarely been practically applied in the literature. Badsha and Fu (2019) have developed MRPC, which incorporates the principle of MR into the PC algorithm. The principle of MR is generalised to account for a variety of causal relationships between SNPs and molecular phenotypes. MRPC begins by learning the graph skeleton using the PC algorithm with an online false discovery rate correction and any edges are oriented to point from SNPs to molecular phenotypes. MRPC then looks for v‐structures in the network between any 3 nodes and uses the principle of MR to help orient edges. Although MRPC has been shown to be very effective for building molecular networks, there is still room to develop further. Within small to medium networks MRPC performs exceptionally, however for very high dimensional data as is common with multi‐omics data, it is still computationally expensive and could be further optimised. A recent paper by Zuber et al. (2020) proposed a multivariable MR and Bayesian model averaging (MR‐BMA) approach that can include information from many IVs using only summary statistics from genetic association studies. It assumes the proportion of true causal risk factors is sparse when compared with all risk factors, which they demonstrate is usually the case with metabolomics data. Using MR‐BMA, they identify high density lipoprotein (HDL) cholesterol as a potential causal risk factor for age‐related macular degeneration, supported by previous literature (Burgess & Davey Smith, 2017). This approach has also been used to identify Apolipoprotein B as key lipid risk factor for coronary artery disease (Zuber et al., 2021). All the above methods using the principle of MR require that the three assumptions of MR are satisfied. As multi‐omics data is large and complex, using MR to sidestep the problems of confounding and reverse causation is important for causal network inference. Causal Graphical Analysis Using GEnetics (cGAUGE) has also been proposed to construct causal networks by Amar et al. (2021). cGUAGE first identifies conditional independencies in the data that are used to identify IVs for downstream MR, and for the construction of large‐scale networks, which is called ExSep. Initially, the skeleton is found using the PC algorithm. Edges between nodes are then oriented. If SNPs are marginally associated with a node X2, but are independent of X2 given another node X1, then this is used as evidence that X1 causally affects X2. cGUAGE does not infer causal effect size, so there is a lot of future potential in integrating ExSep with MR and other approaches to infer the skeleton and quantify causal effects.

Time series data

Time series data provides the opportunity to investigate molecular networks across a biological process. Generating causal networks is made much more difficult with the problems that inherently come with this data type. Particularly, the time between measurements may be inconsistent or not reflect the rate of change that is being investigated, causal relations can greatly change over time and unmeasured confounding variables may be introduced. As multi‐omics data becomes easier to generate, there has been an increased interest in using time‐series data to investigate molecular networks (Barman & Kwon, 2018). The most common approach to identifying causality in time series molecular data is Granger causality which assumes that variable X Granger‐causes Y if values of X provide information that is significant about the future values of Y (Granger, 1969). Heerah et al. (2021) have proposed Granger‐causal analysis of gene expression data that can handle irregularly‐spaced bivariate signals. However, it has some limitations that become obvious when using multi‐omics data. The time intervals between measurements needs to be enough for a noticeable change to take place and there needs to be no confounders. Both assumptions are rarely met with biological data. Stehr et al. (2019) have used Siamese neural networks for causal inference in time series data, which gives the approximate probabilities between nodes. However, this approach has only been performed on balanced synthetic data and has yet to be shown to be effective in real unbalanced data. Multiple Bayesian approaches to inferring causality in time series gene expression data have been developed. fastBMA implements Bayesian model averaging (BMA) to efficiently identify gene regulation networks (Hung et al., 2017). Other Bayesian approaches such as Bayesian Gene Regulation Model Inference (BGRMI) (Iglesias‐Martinez et al., 2016) can integrate known protein interactions and ChIP‐sequencing data as prior knowledge to assist in reconstructing regulatory network of time series gene expression data. Causal analysis of time series molecular data is still very limited. Although new methodologies are being developed in other research domains (Runge, 2018), there has been limited applications to molecular networks. Modern algorithms such as Optimal Causation Entropy (OCE) (Sun et al., 2015) and the PC algorithm with a conditional mutual information (MCI) test to reduce autocorrelation and control false positive rates (PCMCI) (Runge et al., 2019) have been shown to outperform Granger causality and be able to handle large scale networks. Applying these approaches to molecular networks would be an important step in progressing the analysis of time series causal molecular networks.

BIOLOGICAL INTERPRETATION OF NETWORKS

Networks of connected genes can quickly become very complex, which severely limits biological interpretation, even in simple co‐expression network (Serin et al., 2016). Nevertheless, even when interpreting simple networks it is important to distinguish between association and causality. Inappropriate use of causal language has been a particular problem in biological sciences in the past (Boutron & Ravaud, 2018). Causal molecular networks are often high dimensional. Many studies (Azad & Alyami, 2021; Bhattacharya & Das, 2019) have identified smaller subsets of genes they are interested in through previous knowledge of pathways or clustering of undirected networks before inferring causality. However, this can miss out factors that may be relevant within the causal network but are not within the cluster or not identified by traditional univariate analysis. Alternatively, constructing a causal network and then clustering the nodes would identify any functionally close sets of variables that are likely involved in similar biological processes. Few published papers have carried out clustering within causal molecular network. As the size of these networks grow, clustering will become increasingly important to identify biological processes and important causal molecules within them. An advantage of large causal molecular networks is drug discovery and repurposing. Previous approaches to identifying drugs have been focussed on correlating transcription signatures between disease and known drugs (Belyaeva et al., 2021) however this approach generates drugs and therapeutic targets that rarely are further researched, and have not had much success in bringing any new treatments to the clinic. Causal pathways allow for more in‐depth identification of drug targets. Škrlj et al. (2021) have developed Causal Network of Diseases (CaNDis) which uses causal protein–protein interactions to identify FDA‐approved drugs that can impact particular diseases. A known drug pathway signature from databases such as CMap (Lamb et al., 2006) can be matched to the causal network to impact a particular target. Causal networks can also be studied to identify upstream regulators of known disease targets that can be targeted using drugs. Unfortunately, these advancements have had little use in the literature and thus limited translation to the clinic. Further development of methodologies and additional work using these drug discovery tools when constructing molecular causal networks should be included in future research as they become more accessible. Network visualisation is often one of the first steps once networks have been created. One of the advantages of network visualisation is the ability to better communicate the results to readers and colleagues without a full understanding of how results were generated. Appropriate visualisation therefore becomes crucial to reflect the results and get the most from the data. There are many tools that assist in generating networks, including Cytoscape (Shannon et al., 2003) and Gephi (Bastian et al., 2009). These tools generally include a large amount of customisability to visualise the network, particularly in automatically generating layouts. However, visualising and interpreting very large and complex networks can be difficult and often overlooked in the literature. Selecting the best and most appropriate way of displaying networks is very dependent on the type of network that is being visualised, and so requires a large amount of input by someone who understands the data and how it has been analysed. In molecular networks with multi‐omics data, layering the different omics types within the visualisation to show how they interact would give a much more structured view than any predesigned layout that is available. Some approaches, including Bayesian networks and MR, provide causal effect sizes which can be visualised within networks by increasing size of edges for larger effect sizes. This allows experts from other biological fields to interpret the interactions of molecular phenotypes and is more likely to lead to future research. There is potential for creating interactive networks where nodes and edges can be included or excluded by adjusting a causal effect size threshold. One of the aims of causal inference is the identification of a small number of targets for therapeutic interventions and so effective visualisation with easy interpretation can be used by other researchers to identify networks of their particular interest.

CONCLUSION

Building causal molecular networks is becoming increasingly important in biology. Inferring causality from entirely observational data is much less time consuming and less expensive than traditional randomised trials or intervention experiments. Additionally, the availability of genetic and multi‐omics data is massively increasing making casual molecular network inference a very powerful approach. Here, we have reviewed the available approaches to building causal molecular networks. Traditional small‐scale MR approaches infer causality between exposures and outcomes. This makes MR a powerful tool when combined with other approaches to build large‐scale networks but very limited when used on its own. Bayesian network methods, including the PC algorithm, are based on conditional independence properties and rarely scale to large multi‐omics networks well. Additionally, many of the methods developed based on Bayesian networks output a Markov equivalence class that may lead to ambiguity between directed and undirected relationships. Combinations of approaches to inferring causal networks have attracted increasing attention as they bring together the advantages of individual approaches, e.g. augmenting Bayesian networks with the principle of MR, such as MRPC (Badsha & Fu, 2019) and findr (Wang et al., 2019). This has allowed for scaling of networks to a much larger size, however computational cost is still very high. Still, these approaches have not been widely applied in the literature and there is still much to improve. Reducing the impact of unmeasured confounders and horizontal pleiotropy is important in any complex causal inference and is why MR plays an important role in these approaches. These issues are being addressed with modern MR methods such as MR‐egger (Bowden et al., 2015), CAUSE (Morrison et al., 2020) and Bayesian MR, and integrating these approaches into combinations of methods should be a focus in the future. Selecting IVs is also a challenge for large‐scale casual networks. Linkage disequilibrium and pleiotropic effects can violate IV assumptions. Selecting strong IVs would potentially reduce data size, thus reducing computation time, and reduce bias. However, there is a trade‐off as only including strong IVs that only explain a small proportion of variation in the exposures may reduce the precision of the estimates. Therefore, the future challenge is to effectively identify and select for valid IVs that satisfy assumptions and are optimal for large causal molecular networks, which may prove to be especially difficult as it is not known if strong IVs will exist for every phenotype. Many causal molecular network methods have focussed on use of individual level data, which can be difficult to get hold of as it is usually not included on public databases for ethical reasons. Improving the available methods that can infer causality using widely available summary statistics should be a priority for researchers so more can be done with current data. Improved methods, optimal interpretation and visualisation will advance understanding of disease processes. It is scientifically important but computationally challenging to take advantage of the increasing availability of multi‐omics data that are now available, and directly translate to applications in clinical treatment of disease. Given the complex biological structure of certain outcomes, the literature points to a need to develop more flexible and comprehensive approaches to building causal molecular networks.

AUTHOR CONTRIBUTIONS

JK and HG undertook the literature search and co‐wrote the first draft and approved the final version. CB, BK and MT critically appraised the manuscript and approved the final version.

FUNDING INFORMATION

This work was jointly supported by the British Heart Foundation and The Alan Turing Institute (which receives core funding under the EPSRC grant EP/N510129/1) as part of the Cardiovascular Data Science Awards (Round 2, SP/19/10/34813).

CONFLICT OF INTEREST

The authors declare that they have no conflicts of interest.

65 in total

1. Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors: Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal: Genome Res Date: 2003-11 Impact factor: 9.043

2. In silico prediction of physical protein interactions and characterization of interactome orphans.

Authors: Max Kotlyar; Chiara Pastrello; Flavia Pivetta; Alessandra Lo Sardo; Christian Cumbaa; Han Li; Taline Naranian; Yun Niu; Zhiyong Ding; Fatemeh Vafaee; Fiona Broackes-Carter; Julia Petschnigg; Gordon B Mills; Andrea Jurisicova; Igor Stagljar; Roberta Maestro; Igor Jurisica
Journal: Nat Methods Date: 2014-11-17 Impact factor: 28.547

3. Generating a robust statistical causal structure over 13 cardiovascular disease risk factors using genomics data.

Authors: Azam Yazdani; Akram Yazdani; Ahmad Samiei; Eric Boerwinkle
Journal: J Biomed Inform Date: 2016-01-28 Impact factor: 6.317

4. MendelianRandomization: an R package for performing Mendelian randomization analyses using summarized data.

Authors: Olena O Yavorska; Stephen Burgess
Journal: Int J Epidemiol Date: 2017-12-01 Impact factor: 7.196

5. Mendelian randomization accounting for correlated and uncorrelated pleiotropic effects using genome-wide summary statistics.

Authors: Jean Morrison; Nicholas Knoblauch; Joseph H Marcus; Matthew Stephens; Xin He
Journal: Nat Genet Date: 2020-05-25 Impact factor: 38.330

6. Detecting and quantifying causal associations in large nonlinear time series datasets.

Authors: Jakob Runge; Peer Nowack; Marlene Kretschmer; Seth Flaxman; Dino Sejdinovic
Journal: Sci Adv Date: 2019-11-27 Impact factor: 14.136

7. Approximate Learning of High Dimensional Bayesian Network Structures via Pruning of Candidate Parent Sets.

Authors: Zhigao Guo; Anthony C Constantinou
Journal: Entropy (Basel) Date: 2020-10-10 Impact factor: 2.524

Review 8. Protein-protein interaction detection: methods and analysis.

Authors: V Srinivasa Rao; K Srinivas; G N Sujini; G N Sunand Kumar
Journal: Int J Proteomics Date: 2014-02-17

9. Integrative genomics reveals novel molecular pathways and gene networks for coronary artery disease.

Authors: Ville-Petteri Mäkinen; Mete Civelek; Qingying Meng; Bin Zhang; Jun Zhu; Candace Levian; Tianxiao Huan; Ayellet V Segrè; Sujoy Ghosh; Juan Vivar; Majid Nikpay; Alexandre F R Stewart; Christopher P Nelson; Christina Willenborg; Jeanette Erdmann; Stefan Blakenberg; Christopher J O'Donnell; Winfried März; Reijo Laaksonen; Stephen E Epstein; Sekar Kathiresan; Svati H Shah; Stanley L Hazen; Muredach P Reilly; Aldons J Lusis; Nilesh J Samani; Heribert Schunkert; Thomas Quertermous; Ruth McPherson; Xia Yang; Themistocles L Assimes
Journal: PLoS Genet Date: 2014-07-17 Impact factor: 5.917

10. CaNDis: a web server for investigation of causal relationships between diseases, drugs and drug targets.

Authors: Blaž Škrlj; Nika Eržen; Nada Lavrač; Tanja Kunej; Janez Konc
Journal: Bioinformatics Date: 2021-05-05 Impact factor: 6.937

1 in total

Review 1. A review of causal discovery methods for molecular network analysis.

Authors: Jack Kelly; Carlo Berzuini; Bernard Keavney; Maciej Tomaszewski; Hui Guo
Journal: Mol Genet Genomic Med Date: 2022-09-10 Impact factor: 2.473

1 in total