Literature DB >> 26608050

High-order dynamic Bayesian Network learning with hidden common causes for causal gene regulatory network.

Leung-Yau Lo1, Man-Leung Wong2, Kin-Hong Lee3, Kwong-Sak Leung4.   

Abstract

BACKGROUND: Inferring gene regulatory network (GRN) has been an important topic in Bioinformatics. Many computational methods infer the GRN from high-throughput expression data. Due to the presence of time delays in the regulatory relationships, High-Order Dynamic Bayesian Network (HO-DBN) is a good model of GRN. However, previous GRN inference methods assume causal sufficiency, i.e. no unobserved common cause. This assumption is convenient but unrealistic, because it is possible that relevant factors have not even been conceived of and therefore un-measured. Therefore an inference method that also handles hidden common cause(s) is highly desirable. Also, previous methods for discovering hidden common causes either do not handle multi-step time delays or restrict that the parents of hidden common causes are not observed genes.
RESULTS: We have developed a discrete HO-DBN learning algorithm that can infer also hidden common cause(s) from discrete time series expression data, with some assumptions on the conditional distribution, but is less restrictive than previous methods. We assume that each hidden variable has only observed variables as children and parents, with at least two children and possibly no parents. We also make the simplifying assumption that children of hidden variable(s) are not linked to each other. Moreover, our proposed algorithm can also utilize multiple short time series (not necessarily of the same length), as long time series are difficult to obtain.
CONCLUSIONS: We have performed extensive experiments using synthetic data on GRNs of size up to 100, with up to 10 hidden nodes. Experiment results show that our proposed algorithm can recover the causal GRNs adequately given the incomplete data. Using the limited real expression data and small subnetworks of the YEASTRACT network, we have also demonstrated the potential of our algorithm on real data, though more time series expression data is needed.

Entities:  

Mesh:

Year:  2015        PMID: 26608050      PMCID: PMC4659244          DOI: 10.1186/s12859-015-0823-6

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

Inferring gene regulatory network (GRN) has been an important topic in Bioinformatics, owing to the important role it plays in the functioning of the cell. In the cell, genes are transcribed and subsequently translated into proteins, some of which are transcription factors (TFs) which trigger or inhibit the transcription of other gene(s). The transcription and translation, however, take time and may have delays due to other reasons [1-4]. These delays have been known to affect the network stability, or cause oscillations [5-8]. Therefore, the GRN could be modeled as a directed network where each directed link is labeled with the delay, representing the regulation of a gene to a target gene. Rather than experimentally determining the regulatory targets of each Transcription Factor (TF) in an expensive and time-consuming way, many computational methods attempt to infer the GRN from high-throughput microarray or RNA-seq gene expression data, which can measure the expression of thousands of genes at the same time, and allow time series expression data to be obtained when this is done for a number of time points. However, to our knowledge, the previous GRN inference methods all implicitly make the assumption of causal sufficiency, i.e. there are no unobserved common cause, which is convenient but unrealistic. For example, miRNAs were previously not thought to take important roles in gene regulation. It is in principle impossible to be certain that all relevant factors or common causes have been measured, because factors that are not even conceived of cannot be measured. Therefore an inference method that also handles hidden common cause(s) is highly desirable.

Gene network inference

Many GRN inference algorithms and models have been proposed, with different levels of details by labeling the edges with different information, see [9, 10] for surveys of GRN modelling and [11] for a survey on GRN inference algorithms for microarray expression data. Some methods infer only an undirected network, for example Relevance network [12], ARACNE [13] and C3NET [14], all of which use mutual information as a measure of association between genes. Some infer a directed network, but without labeling the edges with time delays, e.g. [15]. Static Bayesian Network (BN) is sometimes used to model GRN, e.g. in [16]. But the downside of static BN is that no cycles are allowed, and no time delays are taken into account. Some algorithms consider delay of only one time step, e.g. [17] which uses association rule mining; [18] which uses the classical Boolean network; [19] which uses Gaussian process for Bayesian inference of an Ordinary Differential Equation (ODE) model discretized in time; and DELDBN [20], which combines ODE model with local Bayesian analysis. In contrast to static BN, dynamic Bayesian Network (DBN) models the joint distribution of some random variables at different time points, and allows time delays. First-Order DBN allows time delays of only one time step, e.g. [21] and GlobalMIT [22]. On the other hand, High-Order DBN (HO-DBN) allows delays of multiple time steps. Many HO-DBN models are discrete and ignore intra-slice edges (i.e. no instantaneous effects), and are learnt by optimizing a score on candidate structure. Since BN learning is NP-hard [23], some DBN learning algorithms use heuristic or stochastic optimization such as simulated annealing, as in Banjo [24] (updated version allows multi-step delays); genetic algorithm, as in [25]; and greedy heuristic search in MMHO-DBN [26] (in case the number of parents is large, and exhaustive search is used otherwise). After [27] had shown it is possible to globally optimize Minimum Description Length (MDL) score [28] and Bayesian-Dirichlet equivalent (BDe) score [29] in polynomial time for certain BN and DBN model, the technique has been adapted to globally optimize the MIT score [30] in GlobalMIT for FO-DBN and GlobalMIT+ [31] for HO-DBN. GlobalMIT* is a heuristic and faster version of GlobalMIT+. Although for small or medium sized networks, GlobalMIT+ and GlobalMIT* could globally optimize the score in reasonable time, when the number of genes and time points increase, the practical time needed could still be long, as the order of the polynomial depends on the number of time points. Therefore, a good heuristic HO-DBN learning method that strikes a good balance between effectiveness and efficiency is still a useful complement. On the continuous side, there are not many algorithms that handle multi-step delays, examples are TD-ARACNE [32], which is an extension of ARACNE; DD-lasso [33], which uses lasso [34]; and CLINDE [35], which extends the PC algorithm [36] with time delays.

Hidden common cause

The above methods ignore the issue of hidden common cause(s) by (implicitly) assuming causal sufficiency, i.e. all common causes have been observed. Inferring hidden common cause(s) is an important topic in causality inference, because ignoring them may result in misleading causal relationships, as illustrated in Fig. 1, where some nodes are wrongly thought to be causally linked.
Fig. 1

Illustration of possible misleading causal relationships if hidden common cause is ignored. The numbers are the delays. The grey circle is the hidden common cause. Since the children and parents of the hidden common cause are associated, they may be mistakenly thought to be directly linked

Illustration of possible misleading causal relationships if hidden common cause is ignored. The numbers are the delays. The grey circle is the hidden common cause. Since the children and parents of the hidden common cause are associated, they may be mistakenly thought to be directly linked An early work is [37], which formulates the problem as determining the constraints on the variance-covariance matrix of observed data, and then searching for a causal structure to explain the constraints. Some works assume the presence of hidden common cause of observed variables, but only indicate that some may have hidden common cause, and focus on the relationships between observed variables instead. The Causal Inference (CI) and Fast Causal Inference (FCI) algorithms [36] are extensions of the PC algorithm to handle the causally insufficient case; similarly the IC* algorithm [38] is an extension of the IC algorithm. Both CI, FCI and IC* give only a partially ordered graph, where some edges may remain undirected, and some are labeled to mean the two genes may have hidden common cause. Eichler [39] is a Granger-causality based method that learns a mixed graph from time series data, where directed edges represent direct causal relationship, and dashed edges represent relationship due to hidden common cause. Pellet and Elisseeff [40] is an extension of the FCI algorithm and does not use time series data. Stochastic differential equation model (discretized in time) is used in [41], where hidden variables are assumed only to more accurately estimate the relationship between observed variables, using a convex optimization based method. In [42], a Satisfiability problem is formulated from the d-separation and d-connection information as provided by conditional tests, which is then incrementally solved to attempt to recover the dependency structure between observed variables, and some may be indicated to have latent variables, and some edges may be marked as “unknown” if the given information is insufficient for determining whether it is present or not. While the above do not have any hidden common cause in the output, some works label predicted hidden common cause(s), but any hidden common cause can only have other hidden variables, but not observed variables as parents. Silva [43, 44] are examples in this direction, where observed variables depend linearly on its parents (either hidden or observed), and hidden variables can depend nonlinearly on its parents (only hidden variables). In [45], a linear Bayesian Network is learnt, but it is assumed that there are no edges among observed variables, and that the hidden variables are linearly independent. Some works are less restrictive and allow the hidden variables to have observed variables as parents. Boyen et al. [46] uses a FO-DBN model, and is based on the observation that ignoring hidden variable in DBN usually results in violation of Markov property. The algorithm therefore tries to find non-markovian correlations (those across more than one time step) and try to explain them by introducing hidden variable. This work, however, evaluates the likelihood on the testing set rather than how close the resulting dependency structure is to the assumed true causal structure. In [47], a discrete BN with hidden variables without time delays is learnt, where hidden variables are assumed to have observed variables as parents and children. It is closer to the work in this paper in that it has less restriction on the parents of the hidden common cause(s) than previously mentioned methods, but it does not handle time delays. It is motivated with the observation that the inferred dependency of the observed variables (the parents and children of the hidden variable) will usually be overly complicated, with many connections, when the hidden variable is not taken into account (see Fig. 1). Therefore, they guess the position of the hidden variable(s) and its local structure by finding semi-clique(s). A semi-clique is a subset of nodes where each node in the subset is connected to more than half of the nodes in the subset. After that adjustments are made by explicitly linking the variables of the semi-clique with the introduced hidden variable and this local structure is then fine-tuned by Structural-EM (SEM) [48]. This work also focuses on the likelihood in the evaluation instead of the inferred structure. The use of semi-clique as structural signature also places some restrictions on the subnetwork surrounding the hidden variable(s), e.g. a hidden variable must have parent(s), which are observed variable(s), and the total number of parents and children of a hidden variable must be at least three, because the smallest semi-clique has size three. Elidan and Friedman [49] complements [47] and focuses on learning the dimensionality (the number of states) of hidden variables. In short, HO-DBN learning methods that can handle multi-step time delays such as GlobalMIT* do not handle hidden common cause(s), and hidden common cause learning algorithms do not handle multi-step time delays, and those without time delays are restricted to acyclic networks. In other words, to our knowledge, no previous methods handle multi-step time delay and learn hidden common cause(s) at the same time.

Objective

In this paper, we want to develop a HO-DBN learning algorithm that can infer also hidden common cause(s) from discrete time series expression data, with some assumptions on the conditional distribution, but is less restrictive than the above mentioned methods. Here, we focus on the discrete case, because combinatorial regulation could be easily modeled by HO-DBN. We assume that there is a d-th order (the maximum delay is d) stationary HO-DBN that is of interest, where a small but unknown number of common cause(s) are not observed. Each hidden variable has only observed variables as children and parents, with at least two children and possibly no parents. We also make the simplifying assumption that the children of unobserved variable(s) are not linked to each other, because it is difficult to differentiate whether the high association between two children are solely due to the hidden common cause, or due to both the hidden common cause and direct link between the two children. As the prior network is difficult to learn, and the transition network is of the main interest, our objective is to infer the transition network of the HO-DBN from the discrete time series of the observed variables. Moreover, it is desirable that the algorithm be capable of utilizing multiple short time series (not necessarily of the same length), as long time series are difficult to obtain. To our knowledge, previous works either make much more restrictive assumptions or ignore time delays.

Methods

The motivating idea of our proposed method is that when a common cause is hidden, the parents of its children will not be determined correctly, and will probably be wrongly predicted to be the parent(s) of the hidden cause, or other children of the hidden cause, as illustrated in Fig. 1. In order to remedy this, the “anomaly” has to be recognized first. The overall strategy is to first learn an initial GRN while ignoring possible hidden common cause, then to detect the presence of hidden common cause(s), and estimate any detected hidden common cause. This overall strategy is similar to that of [47]. But while [47] uses semi-clique as structural signature, and [46] uses correlation across more than one time step as clue of “anomaly”, in this paper, we propose to make assumption on the conditional distribution for this purpose. The idea is that when the parents of a gene are wrongly determined, the conditional distribution will be different from expected. We could predict the genes with hidden common cause using this clue. After that, by the fact that genes with common parent should have high association, the suspected genes could be clustered, and one hidden common cause could be estimated for each cluster with size at least two. Unlike [47], we estimate hidden common cause(s) with simple EM, instead of Structural-EM. The overall flow of the proposed method is given in Fig. 2. The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the estimated conditional distributions, 3) estimate the hidden common cause(s), 4) re-learn the GRN after estimation of hidden common cause(s). The program code can be obtained at https://github.com/peter19852001/hcc_dclinde. We first describe the data and model assumed in this paper, then describe each step in more details, where we first describe the case with one time series, and later we describe the case of multiple time series in a separate subsection.
Fig. 2

Overall Flow of the Proposed Algorithm. The steps are: 1) infer an initial GRN, 2) identify the genes with hidden common cause, 3) estimate the hidden common cause(s), which involves clustering and EM, 4) re-learn the GRN after estimation of the hidden common cause(s)

Overall Flow of the Proposed Algorithm. The steps are: 1) infer an initial GRN, 2) identify the genes with hidden common cause, 3) estimate the hidden common cause(s), which involves clustering and EM, 4) re-learn the GRN after estimation of the hidden common cause(s)

Model and data

The GRN model used in this paper is High-Order Dynamic Bayesian Network (HO-DBN) on n+n random variables at different time points t=1,…,T. Each X represents the expression of gene i at time t, where n of them are observed, and n are hidden. We assume that the DBN satisfies the d-th order Markov property: The order d>1 is the maximum delay. We also assume that the DBN is stationary, i.e. the dependency P(X |X ,…,X ) is independent of t. Therefore, the joint distribution can be factorized as: The P(X 1,…,X ) is the prior network, which needs many independent samples to estimate, so the focus is usually on the transition network P(X |X ,…,X ). Assuming stationary DBN, the transition network could be represented as a multi-graph of n+n nodes (representing the n+n genes), where an edge i→j is labeled with a delay τ ≥0, meaning that X depends on . Note that there may be multiple edges from node i to node j, but with different delays. We make the following assumptions on the structure of the transition network: there are n observed genes and n hidden variable(s). n is unknown, but 0≤n We also assume that the subgraph for which τ =0 (the instantaneous effects or intra-slice edges) is empty. This is commonly assumed, e.g. in [25], MMHO-DBN [26] and GlobalMIT+ [31]. This assumption is quite reasonable in GRN modeling, as the regulatory relationships invariably have delays. I.e. we assume that τ >0. each hidden variable has at least two observed genes as children. if a gene has a hidden parent, it has no other parents. children with the same hidden parent are not linked with each other. for each conditional distribution with n states, one of the states has probability p , and the other states each has probability of . That is, the dependency of a gene on its parent(s) is mostly a function, with some “noise” added. A higher value of p means a lower “noise” level. The given data consists of K discrete time series with equidistant time points for 1≤k≤K. The K time series should be discretized in the same way, so that the states (e.g. 0, 1 and 2 for low, average and high expression) are consistent in different time series.

Initial GRN

For the purpose of identifying genes with hidden common cause(s), the first step is to obtain an initial GRN. In principle, any HO-DBN learning algorithm could be used. In our preliminary test (unpublished), we have adapted CLINDE [35] to discrete data to give D-CLINDE, which is a constraint-based method extending the PC algorithm [36] (in a similar way to CLINDE). We have shown that for large number of genes and time points, D-CLINDE can be orders of magnitude faster than MMHO-DBN and GlobalMIT* (and consequently GlobalMIT+), while achieving slightly better learning performance than MMHO-DBN, and achieving 70–80 % of the learning performance of GlobalMIT*. Therefore, D-CLINDE is a good complement to GlobalMIT* (and GlobalMIT+) for large networks and time points, when even GlobalMIT* would be too slow. Also, both D-CLINDE, GlobalMIT* and GlobalMIT+ can handle multiple time series, while MMHO-DBN cannot. Therefore, in our current program, we allow the use of GlobalMIT*, GlobalMIT+ or D-CLINDE for inferring the initial GRN. We briefly describe D-CLINDE, GlobalMIT+ and GlobalMIT* in the following.

D-CLINDE

D-CLINDE is a constraint-based method, where conditional independence tests on the data constrain the possible causal structure. It consists of two stages. In stage 1, independence tests (G 2 test) are conducted on all gene pairs i→j for all possible delays up to a maximum delay. If the null hypothesis of the independence test is rejected (the score of the test is − log10(p-value), and the null hypothesis is rejected if the score is larger than a score threshold), the link with the associated delay is kept for stage 2. The default value for score threshold is 2, corresponding to a p-value threshold of 0.01. Note that there may be multiple delays for i→j after stage 1. Stage 2 attempts to eliminate indirect effects based on the fact that if x and y are conditionally independent given a set of variable(s) Z (not containing either x or y), then there should not be a link between x and y. So in stage 2, we iteratively condition on h=1 neighbor for each link to see if a link could be pruned, then condition on h=2 neighbors for any remaining links, and so on up to h=N 0 for a given parameter N 0, with a default value of 2. When performing a conditional test, the neighbors to be conditioned on are shifted using the delays estimated in stage 1, and if the null hypothesis is not rejected (based on score and score threshold, and is similar to stage 1), the link is pruned.

GlobalMIT+ and GlobalMIT*

GlobalMIT+ [31] is a method that globally maximizes the MIT score [30], which measures the mutual information between a gene and its parents (with delay), together with a penalty on the number of parents. The basic idea is that for each gene, we enumerate the parents (with the delays), starting from zero parents, then one parent, then two and so on, but with pruning to avoid having to enumerate all possible subsets. The characteristics of the score (to be minimized) that allows effective optimization (in polynomial time) and pruning are: No need to check acyclicity: this allows the score to be calculated separately for each variable. Since GlobalMIT+ ignores instantaneous effects, so the network is always acyclic. Additivity: the score of a candidate network can be decomposed into the sum of the score of each gene. This greatly simplifies the search, and allows easy parallelization. Splitting: the score for each gene could be decomposed into a sum of complexity and accuracy parts as s(Pa)=u(Pa)+v(Pa), where both u(.)≥0 and v(.)≥0, and that the complexity part is “non-decreasing”: u(Pa 1)≤u(Pa 2) for Pa 1⊆Pa 2. Uniformity: the complexity is only a function of the number of parents, i.e. u(Pa 1)=u(Pa 2) whenever |Pa 1|=|Pa 2|. In minimizing the score, if the complexity alone exceeds the best score so far, it is safe to prune the search, as adding more parents could only worsen the score. The key to the proof of polynomial time is a logarithmic bound p ∗ on the number of parents to consider (e.g. by finding a p ∗ for which u(Pa)≥u(∅) if |Pa|=p ∗), so that there are sets of parents with delays to consider, and each could be done in time, making the whole global search polynomial. In the case of GlobalMIT+, with a simple trick the maximization is turned into minimization, and by assuming that all variables have the same number of states k (for uniformity), all of the above conditions are satisfied, so the MIT score could be optimized in polynomial time with p ∗≈ logk(N ) where N is . However, the order of the polynomial depends on the number of time points, and for large networks and large number of time points, the practical running time could still be long. Recognizing this shortcoming, GlobalMIT* is a heuristic and faster version of GlobalMIT+ with the additional assumption that for each pair of genes i→j, there is only one delay, and that delay has the best MIT score. So GlobalMIT* first finds the best delay individually for each pair i→j, and need not try the delays in subsequent optimization. This substantially reduces the search space, speeding up the search greatly. However, in our preliminary test, the practical running time could still be long for large number of genes and time points.

Identification of genes with hidden common cause

Having obtained the initial GRN of the observed genes, we can estimate the conditional distribution of each gene by maximum likelihood, and then estimate the of each gene, to compare with the expected bias. In this paper, we use a simple method to estimate the bias. For each gene g, for each configuration Q of its parent(s) P a , we calculate the maximum probability of the conditional distribution as maxjP(g=j|P a =Q ), and we use the median of the maximum probability over the parent configurations Q ’s as the estimate of the bias for gene g. For each gene, we compare the estimated bias with the expected bias p , if we predict the gene to have hidden common cause, where ρ is the tolerance with a default value of 0.05. The idea is that if a gene has no hidden common cause, we expect its parents (and delays) to be correctly determined (given sufficient data), so the estimated bias should be close to expected. On the other hand, if a gene has hidden common cause, its true parents could not be determined correctly, and we expect the estimated bias to be different from expected. Those genes determined to have hidden parents are called candidates. If the number of observed genes n is small, we assume that the expected bias is known and given. On the other hand, when n is larger, by the assumption that there are only a small number of hidden variables, we could attempt to estimate the expected bias from the estimated biases of the the observed genes. We simply use the median of the estimated biases as the expected bias for this study, if it is not given. We discuss a possible alternative strategy for estimating the expected bias as future works in the conclusions.

Estimation of hidden common cause(s)

Clustering the candidates

We simply output the initial GRN as the final GRN if there are no candidates. Otherwise, based on the fact that genes with common parent are associated, we cluster the candidates to determine which genes have a common parent, and also to estimate their relative delays for estimating the hidden common cause(s). Although there are many different clustering algorithms, we found that even a simple greedy clustering algorithm works adequately from our preliminary tests. The idea is that we consider each candidate in turn, and find the cluster center that is closest to it, and if it is close enough, it is added to that cluster; otherwise, the candidate forms a new cluster. The steps are: Let the k candidates be {g 1,g 2,…,g } Set n ←1, c 1←g 1, τ 1←0, C 1←{g 1} For i=2,…,k Let , and set τ be the associated time shift of g relative to If , update Otherwise, set n ←n +1, then set , , τ ←0 Output the n clusters {C :1≤j≤n }, and the time shifts {τ :1≤i≤k} c is the center of cluster i, C is cluster i. τ is the time shift of candidate g relative to its cluster center. d(x,y) measures the similarity of two time series x and y, here we use the maximum − log10(p-value) of G 2 tests of the shifted time series (shift y relative to x, from −d to d, where d is the maximum delay). S 0 is the threshold for a series to be included in a cluster, with a default value of 2.3 (from our preliminary tests, this value seems to work well, although a value of 1.3 also seems to work adequately).

Estimating the hidden common cause by expectation maximization

After the clustering, we would estimate a hidden common cause (estimating its time series) for each cluster with two or more members. If no cluster has size at least two, we simply output the initial GRN as the final GRN. For each cluster with size at least two, we perform up to two rounds of EM. The first round estimates a hidden common cause (as parent) of the genes in the cluster without considering potential parents of the hidden common cause. The second round uses the estimated time series of the hidden common cause to find potential parents from all observed genes (not limited to the cluster under consideration) by picking those with high associations with the estimated hidden common cause, and re-estimate the hidden common cause treating the found (if any) potential parents as parents of the hidden common cause. But note that any identified potential parents of a hidden common cause may not be the true parents of the hidden common cause, as they are found by only considering pairwise associations but not possible indirect effects. So we still rely on the relearning of the GRN after estimating hidden common cause(s) to more accurately identify the parents of the hidden common cause(s), if any. However, we expect the identified potential parents to contain useful information for the estimation of the hidden common cause. We use simple Expectation Maximization (EM) [50] to optimize the log-likelihood, where the states of the hidden common cause at the time points are the latent variables. Let the hidden common cause to be estimated be h. The number of states of h is either given as a parameter, or the maximum of the number of states of the children if not given. We perform two rounds of EM, each with a default of 100 iterations, and with restarts. Below we briefly describe the EM steps. Suppose for cluster C={g 1,g 2,…,g |} with |C|>1 that we want to estimate a hidden common cause h with n states, which may have potential parents identified (for the second round). We first note that the different series may not be aligned because of different time shifts, as illustrated in Fig. 3. Suppose the time points of interest are t ≤t≤t , we denote the state of h at time t as h , which are the latent variables in the EM. Let the configuration of the potential parents of h be denoted by Q, and the value of Q at time t be denoted by Q , and let x be the value of g at time t (if available). Our goal is to estimate the most probable h for t ≤t≤t given D={Q }∪{x }. The parameter of the likelihood is θ={P(h|Q)}∪{P(g |h)}, where P(h|Q) becomes P(h) if h has no potential parents.
Fig. 3

Illustration of un-aligned series for estimating hidden common cause

Illustration of un-aligned series for estimating hidden common cause We first randomly initialize the parameter θ (0)={P(0)(h|Q)}∪{P(0)(g |h)}, then repeat the E-step and the M-step for a default of 100 iterations: E-step: at iteration k, for each time t, and for 0≤j M-step: we update the parameter for the next iteration as follows. After the iterations, we output the for which L(θ () is maximum as the estimate of the most probable h for this round of EM. After the first round, we use the estimated most probable h to find potential parents of h, by performing G 2 tests with all observed genes with different time shifts, using a score of − log10(p−v a l u e). A gene (with a particular time shift) could be a potential parent of h if the score is at least 2, and we take only 3 potential parents with the highest scores if there are more than 3. If any potential parent is found, we perform the second round of EM with the parents properly shifted to re-estimate the most probable h . Lastly, we take where for 1≤t≤T−1 and as the estimate of the hidden common cause of cluster C , i.e. take the suffix of h and shift it so that precedes all the genes in C in time.

Re-learn the GRN after estimation of hidden common cause(s)

If there are no estimated hidden common cause(s), we simply output the initial GRN as the final GRN. Otherwise we re-learn the GRN using the the original observed expression together with the estimated hidden time series of the common cause(s) to give the final GRN, but we disallow any links between the candidates in the same cluster. Similar to inferring the initial GRN, either GlobalMIT*, GlobalMIT+ or D-CLINDE could be used (can be chosen independently from the choice of initial GRN).

Handling multiple time series data

The above describe the steps of the proposed algorithm when one time series data is provided, we now describe the case where multiple time series data are provided, where the series are not necessarily of the same length. The main idea is that when shifting the time series by a delay (e.g. for G 2 test), all the time series are shifted, and the overlapping parts are concatenated for the calculation. This is illustrated in Fig. 4.
Fig. 4

Illustration of shifting the multiple time series

Illustration of shifting the multiple time series Since D-CLINDE, GlobalMIT* and GlobalMIT+ can handle multiple time series, inferring the initial GRN and re-learning the GRN after estimation of hidden common cause(s) pose no difficulty. For estimating the hidden common cause(s) using EM, for each time series, we shift according to estimated delay, and instead of only taking the overlapping parts, we “expand” each time series, and concatenate them, as illustrated in Fig. 3.

Results and discussion

In this section, we assess the effectiveness of the proposed algorithm. Since both long time series expression of large GRN and the knowledge of true GRN are lacking, we mainly use synthetic data for evaluation. Moreover, to our knowledge, there are no previous work that infers hidden common cause(s) for HO-DBN, so we only compare our algorithm on incomplete data, with D-CLINDE and GlobalMIT* on incomplete and complete data. We have generated three types of synthetic data for evaluation: case I) small GRN with one hidden variable and the bias is known; case II: small GRN without hidden variable and the bias is known; case III: large GRN (50 and 100 observed genes) with more than one hidden node and the bias is unknown. For each case, we generate two types of data: one long time series where we take prefixes of different lengths; and multiple short time series where we use different number of time series for different total number of time points. For cases I and II, since the networks are small, we use GlobalMIT* and D-CLINDE for inferring initial GRN and re-learning the final GRN; but for case III, since the networks are large and the number of time points required for decent performance is also large, we use only D-CLINDE to avoid long running time. In all three cases, our proposed algorithm is not given the number of hidden variables. The parameters for generating the synthetic data are summarized in Table 1, and we describe the three cases in the following sections.
Table 1

Parameter settings of synthetic data generation

ParameterCase I, IICase III
Parents (p)0, 1, 2, 3
Children (c)2, 3, 4, 5
Observed genes (n) p+c 50, 100
Hidden nodes (n h)1 for case I,5 for n=50,
II 0 for case10 for n=100
p bias 0.65, 0.75, 0.850.65, 0.75, 0.85
Number of states33
Maximum delay (d)44
EM Iterations1001000
Replicates2040
Time points (T)100, 200,100, 200, 400, 800, 1000,
400, 8001200, 1400, 1600
Number of short time series (K)4, 8, 16, 324, 8, 16, 32, 40, 48, 56, 64
p bias known?YesNo
Parameter settings of synthetic data generation We also attempt to evaluate on real data, but as mentioned, due to the lack of long time series expression real data, it is infeasible to test our algorithm on large GRN, so we could only demonstrate our algorithm on small GRNs, but the expression data is still insufficient, so this cannot be regarded as a thorough evaluation. For this purpose, we use expression data from [51], which measures the expression of over 6,000 genes of Saccharomyces cerevisiae using DNA microarrays, with three different methods of synchronization for studying yeast cell cycle. Together with previous data from [52] (also included in [51]), there are 4 time series with information shown in Table 2. And we use YEASTRACT [53] for the GRN. YEASTRACT is a curated database of over 200,000 transcription regulatory associations in Saccharomyces cerevisiae. Since the GRN is far too large for the available expression data, we extract a small number of small subnetworks for the demonstration instead.
Table 2

Information of the real data time series

SeriesRaw time points (Min)Interpolated time points (Min)
alphaevery 7 mins from 0 to 119every 10 mins from 0 to 120
cdc1510, 30, 50, 70, 80, 90, 100,every 10 mins from 10 to 290
110, 120, 130, 140, 150, 160,
170, 180, 190, 200, 210, 220,
230, 240, 250, 270, 290
cdc28every 10 mins from 0 to 160same time points
eluevery 30 mins from 0 to 390every 10 mins from 0 to 390
Information of the real data time series In the following, we first describe the performance metrics, and then the generation of synthetic expression data once the GRN is given, and then describe the generation of the synthetic GRN for the different cases, and the results on the three types of synthetic data. After that, we describe the preprocessing of the YEASTRACT subnetworks and the expression data, and then present the results of our algorithm on the real data.

Performance metrics

We assess the performance of the inference algorithm on Links (which is considered correct if and only if both the gene pair and the direction are correct) and Delays (which is considered correct if and only if both the link and the time delay τ are correct). For each aspect, we mainly look at F-score as an overall measure of performance, given by F-score , where Recall , Precision , and TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives. From our experience, usually the Links and Delays are inferred correctly at the same time, rather than getting one correct but missing the other. This is quite reasonable, as having a wrong delay may result in totally different associations, so the link is unlikely to be correct. Therefore, we focus on Delays, as it implies the Links. We still need to address the issue of comparing a predicted GRN with hidden variables against the true GRN with hidden variables, because while the hidden variables in the true GRN are labeled, the indices of the predicted hidden variable(s) may not be the same as that in the true GRN. We therefore need to map the predicted hidden variables to the true GRN before calculating the performance using the above metrics. In addition, note that for the links to/from a hidden variable, the delays cannot be completely determined. This is illustrated in Fig. 5, where the delays of links out of a hidden variable can be increased/decreased, and be compensated by the same decrease/increase in links into the hidden variable. Therefore, we may need to try different delay shifts in mapping a predicted hidden variable to true hidden variable, for useful calculation of the performance.
Fig. 5

Illustration of shifting the delays for hidden variable

Illustration of shifting the delays for hidden variable We try to align each predicted hidden variable to each of the true hidden nodes, and choose the one with the most matched links (to/from observed genes only) and delays (after shifting). And in case of ties, we arbitrarily choose the true hidden variable with the lowest index. After the mapping of predicted hidden variable(s), the performance of the predicted GRN is calculated as described above.

Generation of synthetic expression data

Given a HO-DBN (the transition network), we generate expression data by using uniform independent distribution for the prior network to generate d (the maximum delay) time points (not included in the final expression data), then generate a time series of the required length using the conditional distributions in the transition network. For generating multiple short time series, the length of each series is uniformly chosen from 20 to 35.

Case I: synthetic small GRN with one hidden node

We first test our proposed algorithm on small GRN where there is only one hidden node, and the bias p is assumed known.

Network generation

The GRN in this case is illustrated in Fig. 6, where there is one hidden variable, which has p≥0 parents and c≥2 children. But the algorithm is not given the number of hidden variables. For each link, the delay is uniformly chosen from {1,…,d}, where d=4. Each variable has 3 states (including the hidden variables), and the inference algorithm uses the maximum number of states of the children as the estimate of the number of states of any hidden common cause, so the predicted hidden variables also have 3 states. For each configuration of the parent(s), one state is randomly chosen as the dominant state in the conditional distribution and receives a probability of p , and the remaining states share the probability of 1−p equally.
Fig. 6

Illustration of the small synthetic network for case I. The hidden variable has p≥0 parents and c≥2 children

Illustration of the small synthetic network for case I. The hidden variable has p≥0 parents and c≥2 children The different values of the parameters we have tested are shown in the column Case I, II of Table 1. For each setting of p, c and p , we generate 20 replicates, for a total of 960 GRNs. For the one long time series case, for each replicate, we generate expression data with 800 time points, and then take prefix to get T time points, and output only the expression of the observed genes. And for the multiple short time series case, for each replicate, we generate 32 time series, we test using K time series at a time.

Results

Table 3 shows the median Delays F-score of our proposed algorithm on case I with D-CLINDE and GlobalMIT* (for initial GRN and re-learning of the final GRN) using one long time series of different lengths, and Table 4 shows the results for using different number of short time series, where the medians are taken over the 20 replicates in each setting.
Table 3

Median delays F-scores of case I using long time series with D-CLINDE and GlobalMIT*

p bias=0.65 p bias=0.75 p bias=0.85
p c T D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*
021000.0000.5000.5000.2000.0000.000
2001.0000.9000.4500.6670.0000.000
4000.9001.0000.0000.0000.0000.000
8001.0001.0001.0000.9000.0000.000
31000.4000.4000.4000.4000.3670.400
2000.8000.8000.7330.8000.4170.733
4000.8000.8000.8000.8000.6670.667
8000.8290.8000.8000.8000.8000.733
41000.3100.5000.5710.6670.6190.667
2000.5860.6670.7080.6670.7500.804
4000.6670.7080.7500.8570.6750.708
8000.8890.8570.8570.8570.7080.829
51000.4440.5000.6670.7080.6670.667
2000.6330.6330.6060.6670.6670.750
4000.8000.7390.7640.8000.6970.667
8000.8000.8890.8170.8890.7850.739
121000.3670.4000.4000.4000.3330.400
2000.0000.0000.6670.7330.3330.400
4000.5000.8000.7330.8000.1430.000
8000.7330.8000.7330.8000.6190.733
31000.4170.5710.5710.6190.5710.619
2000.5000.5710.7500.8570.7080.536
4000.8040.8570.8730.8730.7500.857
8000.8570.8570.8890.8890.5080.606
41000.2860.4720.4720.5710.8890.889
2000.6000.6670.6670.7080.8611.000
4000.8550.9440.6670.7080.8440.889
8000.9090.9090.8000.8170.8000.889
51000.4000.4000.6060.7210.6330.721
2000.6330.6000.7690.8000.6920.785
4000.7690.8330.9090.9090.6150.748
8000.8010.8010.9160.9620.6970.909
221000.2680.3100.3330.5710.0000.000
2000.3670.3670.6610.8570.2680.310
4000.5360.6670.8570.8890.5710.393
8000.6190.6670.8890.8890.1110.125
31000.2860.2860.4950.5860.4720.667
2000.5000.5000.6670.7500.5580.606
4000.4220.5000.7080.7750.7180.750
8000.7270.7750.8000.8890.8000.889
41000.3480.5000.5000.6670.4720.667
2000.4500.6000.6080.6970.7640.800
4000.5860.8000.7690.8710.7080.855
8000.7640.8170.8010.9090.7270.909
51000.4130.4620.6150.6900.6650.769
2000.5000.6200.8080.8620.6900.845
4000.7420.8330.8570.8780.8570.923
8000.8380.9620.8660.9230.8120.857
321000.2360.2680.2500.5000.2680.393
2000.2220.2500.5000.5710.2860.619
4000.2680.2860.6970.7500.6000.708
8000.4440.5000.8000.8610.6000.739
31000.2220.2220.4000.4720.3640.573
2000.3070.4220.4720.5730.6970.800
4000.3640.5000.6970.7270.7270.909
8000.6000.7270.7270.8000.8010.909
41000.3330.3480.3480.4220.5000.697
2000.3820.4730.5520.6670.8130.833
4000.4450.4720.7600.8950.8570.899
8000.6670.7850.8290.8900.8290.866
51000.3880.3580.4290.4810.7140.769
2000.4140.6150.6250.7140.8120.857
4000.6940.6560.7060.7690.8750.933
8000.7080.8660.7890.8570.9040.933
Table 4

Median delays F-scores of case I using multiple short time series with D-CLINDE and GlobalMIT*

p bias=0.65 p bias=0.75 p bias=0.85
p c K D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*
0240.0000.0000.0000.2500.0000.000
80.0000.0000.8330.8330.0000.000
160.2500.8331.0001.0000.2500.250
321.0001.0000.9001.0000.6500.650
340.4500.4000.6670.7330.8000.800
80.7330.8000.6670.7330.6670.800
160.6670.8000.8000.8000.6670.667
320.8290.8000.8290.8000.6670.800
440.5360.6190.6190.7620.7620.857
80.6670.7500.7500.8570.7500.750
160.7500.8570.7500.8040.7080.857
320.8570.9290.8570.8570.7500.804
540.5580.5860.4950.5860.6670.750
80.5500.6190.6970.8000.6670.750
160.8000.7640.8890.8990.6330.739
320.8890.8890.8890.8890.6970.750
1240.4000.4000.4860.5330.6190.800
80.4520.5330.6670.8000.6190.733
160.6670.8000.6670.7330.7330.800
320.7750.8000.6670.7330.6670.800
340.3100.3670.5710.5710.5360.667
80.5710.6670.8040.8570.7500.857
160.8570.8570.8730.8730.8040.857
320.8570.8570.8290.8730.8290.873
440.4720.5000.5000.5710.8000.889
80.5000.5360.6670.7750.8991.000
160.6970.8000.7390.7750.8611.000
320.8170.8990.8000.8000.8990.955
540.5450.7270.6670.8230.5500.573
80.6410.6670.7180.9090.7800.855
160.8610.9090.8450.9090.8330.909
320.8990.9090.8570.9160.7690.883
2240.2860.3330.5000.5710.3100.367
80.4000.4000.6330.7500.1250.200
160.5710.6670.7500.8730.2500.333
320.5860.6670.8290.9440.5710.667
340.2860.3930.4950.5710.4720.536
80.3890.4170.6970.8890.5730.750
160.5000.5710.8000.8000.8000.889
320.7080.7500.7750.7750.7640.861
440.3820.4440.5450.6330.6000.523
80.5000.7640.7270.8170.5800.697
160.7480.7850.8330.9090.8010.871
320.8330.8130.9090.9620.8330.909
540.3330.4310.5710.7180.6780.688
80.5450.6080.7750.8450.7690.812
160.6670.7480.8000.9230.8280.801
320.9330.9230.8570.9230.8570.923
3240.2220.2500.4000.3890.2500.365
80.0000.0000.4220.4680.4720.675
160.3100.2680.6330.7500.6670.819
320.4720.5000.7270.8000.6670.750
340.2220.3430.4000.5000.6000.600
80.4220.5000.4720.6000.6060.697
160.5000.5500.6410.7640.7480.855
320.5730.6670.7690.8170.7850.909
440.3080.4140.4450.4640.7140.727
80.4290.5110.6900.7270.8130.801
160.5230.5860.7600.9230.7850.890
320.6900.7390.8000.9230.8570.899
540.3540.3880.5170.6670.6670.667
80.5170.6150.6070.7460.7890.857
160.5330.7690.7870.8570.8810.933
320.7780.8570.7750.8660.8330.933
Median delays F-scores of case I using long time series with D-CLINDE and GlobalMIT* Median delays F-scores of case I using multiple short time series with D-CLINDE and GlobalMIT* First of all, we see that even for these relatively small networks, the number of time points required for decent performance is quite large. This may be due to that the algorithm does not assume that the number of hidden common cause is known. Besides, since the dependency in HO-DBN can be combinatorial (different configurations of the parents have different conditional distributions for a node), which may also be the reason that a large sample is needed. For large T or K, our proposed algorithm can perform adequately (with either D-CLINDE or GlobalMIT*), except for c=2, where the performance is more erratic (e.g. p=2, c=2 and p =0.85) and may be poor even when T=800. One possible reason is that when c=2, there is less information for estimating the hidden common cause. Also, the performance of p=3 is worse than the corresponding result in p=2. One possible reason is that with more parents, it is more difficult to identify all the potential parents of a hidden common cause after the first round of EM, because only pairwise association is used in the identification. Moreover, even if the potential parents have been correctly identified, the estimation of the hidden common cause in the second round is difficult, because there are more configurations for the parents, and consequently more conditional distributions for the hidden common cause, and therefore there are less samples in each cell of the contingency table. Comparing using D-CLINDE and GlobalMIT* for our proposed algorithm, the difference in the performance is small when T or K is large, but usually D-CLINDE is slightly worse, which is quite reasonable because D-CLINDE is a simple heuristic. In short, the results show that our proposed algorithm can adequately recover hidden common cause in small GRN, with large enough number of time points.

Case II: synthetic small GRN without hidden node

We also test on small GRN without any hidden variables, where the algorithm is not given the number of hidden variables, but the bias p is known. The parameters are the same as in case I, which are shown in the column Case I, II of Table 1. For each GRN (p, c, p and replicate) in case I, we use GlobalMIT* alone on the (incomplete) data of 800 time points to infer an GRN, which is definitely wrong as all true links are to/from the hidden variable. If the inferred GRN is non-empty, it is used; otherwise, a small GRN of a node in the middle with p parents, and c−1 children is generated as in case I, but all genes are labeled as observed. Having obtained the 960 GRNs without hidden nodes, the time series are generated as in case I. Table 5 shows the median Delays F-score of our proposed algorithm on case II with D-CLINDE and GlobalMIT* (for initial GRN and re-learning of the final GRN) using one long time series, and Table 6 shows the corresponding results using multiple short time series.
Table 5

Median delays F-scores of case II using long time series with D-CLINDE and GlobalMIT*

p bias=0.65 p bias=0.75 p bias=0.85
p c T D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*
021001.0001.0001.0001.0001.0001.000
2001.0001.0001.0001.0001.0001.000
4001.0001.0001.0001.0001.0001.000
8001.0001.0001.0001.0001.0001.000
31000.5830.4000.9000.6670.8570.667
2000.8001.0000.8001.0001.0001.000
4001.0001.0001.0001.0001.0001.000
8001.0001.0001.0001.0001.0001.000
41000.5000.4500.5710.5710.6410.667
2000.7330.8890.7080.8890.7180.906
4000.8731.0000.8731.0000.9161.000
8000.9441.0000.8891.0000.8571.000
51000.4500.3670.3330.5080.5230.404
2000.6670.8330.7500.8570.6970.800
4000.8441.0000.8831.0000.8570.962
8000.8891.0000.9161.0000.8571.000
121000.6670.5830.6970.9000.7620.833
2000.9291.0001.0001.0001.0001.000
4001.0001.0001.0001.0000.7331.000
8001.0001.0001.0001.0000.7331.000
31000.4170.4720.5710.6860.7330.775
2000.6670.9620.8000.8440.9441.000
4000.8731.0000.9441.0001.0001.000
8000.8891.0001.0001.0001.0001.000
41000.3100.2790.5000.4320.7330.762
2000.6190.8730.6670.8890.8451.000
4000.8001.0000.8991.0000.8991.000
8000.9711.0000.8991.0000.9231.000
51000.3330.3210.3690.4140.4000.422
2000.4370.5520.5880.8000.6190.750
4000.7430.9230.8751.0000.7530.944
8000.9160.9780.9231.0000.9471.000
221000.5000.5000.4720.6190.6670.733
2000.6670.9290.8001.0000.7331.000
4000.9551.0000.8001.0000.7851.000
8000.8891.0000.8891.0000.8291.000
31000.4220.4500.5000.4860.5860.667
2000.6670.8290.8000.8890.6670.857
4000.8001.0000.8291.0000.9161.000
8000.7751.0000.8991.0000.9551.000
41000.2540.2220.4450.5860.4530.573
2000.6410.6670.7600.9090.7100.933
4000.8661.0000.8891.0000.8821.000
8000.8991.0000.9161.0000.8891.000
51000.2500.2250.4620.4460.5280.473
2000.4000.5000.6330.8570.7030.769
4000.6530.7050.8991.0000.8570.950
8000.8060.9800.9520.9800.9281.000
321000.2500.1110.5000.5000.5580.857
2000.5360.8290.5710.8900.8610.944
4000.6670.8830.8001.0000.8291.000
8000.8041.0000.8731.0000.8001.000
31000.4000.3670.5000.4040.8170.690
2000.6900.8570.6670.9060.6970.829
4000.8001.0000.7751.0000.8570.944
8000.8661.0000.9061.0000.8990.928
41000.3100.2500.4640.6020.4440.591
2000.3250.4110.6670.8900.6760.873
4000.6410.8660.8661.0000.8840.978
8000.7500.9370.9161.0000.8940.952
51000.2380.2930.3670.4080.5020.529
2000.4450.5170.5490.8210.5520.732
4000.6460.8230.8140.9580.7810.923
8000.8240.9470.9160.9850.8820.974
Table 6

Median delays F-scores of case II using multiple short time series with D-CLINDE and GlobalMIT*

p bias=0.65 p bias=0.75 p bias=0.85
p c K D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*D-CLINDEGlobalMIT*
0241.0001.0001.0001.0001.0001.000
81.0001.0001.0001.0001.0001.000
161.0001.0001.0001.0001.0001.000
321.0001.0001.0001.0001.0001.000
340.5830.5000.5000.6670.9290.667
81.0001.0000.9001.0001.0001.000
160.8001.0001.0001.0001.0001.000
320.9001.0001.0001.0001.0001.000
440.3670.4000.8000.7750.6670.667
80.8000.8890.8891.0000.8000.929
160.9441.0000.8891.0000.8731.000
321.0001.0000.8891.0000.8571.000
540.4000.3750.5450.6000.6330.633
80.7500.8730.7500.8890.7850.909
160.8380.8990.8260.9060.8570.916
320.8891.0000.9161.0000.8660.967
1240.6670.6670.8001.0000.8001.000
81.0001.0000.8331.0001.0001.000
160.9291.0001.0001.0001.0001.000
321.0001.0001.0001.0001.0001.000
340.4000.3330.8570.8440.8290.873
80.6331.0001.0001.0000.8991.000
160.8891.0001.0001.0001.0001.000
320.8571.0001.0001.0000.9161.000
440.3670.2920.5000.7500.6190.829
80.8000.8000.7750.7750.7690.967
160.8291.0000.9291.0000.8571.000
320.9411.0000.9231.0000.9161.000
540.4250.4140.4450.3780.5580.627
80.5330.8170.6250.8010.6900.861
160.7030.9400.8120.9330.8171.000
320.9320.9440.8991.0000.9411.000
2240.6190.6670.5830.5000.6671.000
81.0001.0000.7080.9290.7331.000
160.9291.0000.8291.0000.8291.000
321.0001.0001.0001.0000.8291.000
340.4440.2860.5360.7330.6670.733
80.7080.7640.8000.8730.8731.000
160.8570.9290.8451.0001.0001.000
320.8570.9670.8731.0001.0001.000
440.2970.4720.4370.6460.5170.708
80.5230.6670.7331.0000.8000.921
160.8000.9160.9061.0000.8291.000
320.9060.9670.9711.0000.8731.000
540.2860.3330.4690.5410.4850.536
80.4550.6000.6830.7600.6860.778
160.6860.8800.8940.9520.8400.935
320.7770.9350.9521.0000.9281.000
3240.4170.3330.5230.6330.5830.708
80.5360.8040.6671.0000.8731.000
160.7500.9060.8821.0000.8731.000
320.7501.0000.7641.0000.7751.000
340.3640.3100.4720.7850.5000.697
80.6330.9290.7500.9280.7270.857
160.7391.0000.8571.0000.8440.971
320.8171.0000.8401.0000.8571.000
440.1740.1910.4500.6670.5280.528
80.4140.7500.7390.9060.7010.912
160.5580.8570.9091.0000.7810.950
320.8400.9160.9541.0000.9090.976
540.2160.1950.4010.4000.5050.574
80.4900.5780.6490.6890.7320.819
160.6670.8850.7910.9160.8210.947
320.8190.9430.8920.9690.8750.974
Median delays F-scores of case II using long time series with D-CLINDE and GlobalMIT* Median delays F-scores of case II using multiple short time series with D-CLINDE and GlobalMIT* The performance of our algorithm using either D-CLINDE or GlobalMIT* is good when T≥400 or K≥16, and sometimes it is good even with T≥200 or K≥8. Also, in many settings, the F-score of using GlobalMIT* can reach 1, while D-CLINDE can sometimes reach 1. Similar to case I, using D-CLINDE is slightly worse than using GlobalMIT*. The results show that with adequate number of time points, our proposed algorithm can infer the GRN correctly when there are no hidden common cause, and does not introduce hidden common cause needlessly.

Case III: synthetic large GRN with more than one hidden node

Besides the above two cases for small GRN, we also test the more realistic case of larger GRN with more than one hidden node (but the number is unknown), and that the bias p is unknown. For a network with n observed genes, we would generate hidden variables. For n observed genes and n hidden nodes, a maximum of M 0 parents for observed genes, a maximum of d as delay, we generate a GRN with the structure shown in Fig. 7, where there are four types of nodes: hidden, parents of hidden, children of hidden, and other. The hidden nodes have a random number of distinct parents and children. Parents of hidden take (either 1 or 2 of) other as parents; other take (either 1 or 2 of) any observed genes as parents. After generating the links, the delays and conditional distributions are generated as in cases I and II.
Fig. 7

Illustration of the large synthetic network for case III. Each hidden variable has up to 3 parents, and up to 5 distinct children. The parents of hidden variables can only have other genes as parents, while the other genes can have any observed gene as parents

Illustration of the large synthetic network for case III. Each hidden variable has up to 3 parents, and up to 5 distinct children. The parents of hidden variables can only have other genes as parents, while the other genes can have any observed gene as parents The parameters that we have tested are listed in column Case III of Table 1. For each setting of n and p , 40 replicates are randomly generated, for a total of 240 GRNs. For the expression data, we generate up to 1600 time points for the long time series case, and up to 64 time series for the multiple short time series case, to assess the time points needed for decent performance for networks of size 50 and 100. Tables 7 and 8 show the median Delays F-score on case III using one long time series and multiple short time series respectively, where complete is D-CLINDE alone on the complete data, which is the unrealistic case that the expression of all the n+n nodes are given; hidden is our proposed algorithm using D-CLINDE on the incomplete data, which is the more realistic case that the expression of the n hidden nodes are not given; and ignoreHidden is D-CLINDE alone on the incomplete data, which does not infer hidden common causes. The medians are taken over the 40 replicates in each setting. We also show the ratio of hidden over complete as percentage. We have also performed one-sided Wilcoxon signed rank tests on whether the median F-score of hidden is better than ignoreHidden, and show the p-values which are smaller than 0.1.
Table 7

Median delays F-scores of case III using long time series with D-CLINDE

n n h p bias T Complete (C)Hidden (H)IgnoreHiddenH/C p-value
5050.651000.5260.2590.33949.2 %
2000.7230.4350.51060.2 %
4000.8410.5900.61170.1 %
8000.8980.7570.64784.3 %6.37E-12
10000.9060.7770.66085.8 %9.09E-13
12000.9110.8060.66288.5 %9.09E-13
14000.9160.8220.66089.7 %9.09E-13
16000.9230.8390.66090.9 %9.09E-13
0.751000.6690.3560.45553.1 %
2000.8120.4880.57960.1 %
4000.8640.6760.63178.3 %2.20E-05
8000.9050.7820.64386.5 %9.09E-13
10000.9110.8180.63689.9 %9.09E-13
12000.9100.8280.62991.0 %1.85E-08
14000.9130.8310.63591.1 %9.09E-13
16000.9170.8280.63490.3 %9.09E-13
0.851000.7250.4220.52058.2 %
2000.8230.5540.59767.4 %
4000.8840.7020.63479.5 %1.74E-08
8000.9110.8030.64188.1 %9.09E-13
10000.9150.7960.63887.0 %9.09E-13
12000.9150.8250.62990.2 %9.09E-13
14000.9170.8130.62988.7 %9.09E-13
16000.9120.8180.62589.7 %9.09E-13
100100.651000.4940.2900.31058.6 %
2000.7080.3870.49454.6 %
4000.8240.5710.60069.3 %
8000.8830.7150.63981.0 %2.73E-12
10000.8960.7580.64284.5 %9.09E-13
12000.9000.7900.64987.8 %9.09E-13
14000.9110.7960.65287.4 %9.09E-13
16000.9130.8010.64687.8 %9.09E-13
0.751000.6470.3620.44256.0 %
2000.7950.5150.57764.8 %
4000.8640.6920.63080.1 %8.22E-10
8000.9000.7840.63387.1 %9.09E-13
10000.9090.7940.63487.4 %9.09E-13
12000.9160.8050.63887.9 %9.09E-13
14000.9170.8170.63589.2 %9.09E-13
16000.9210.8270.63289.9 %9.09E-13
0.851000.7050.4190.50559.4 %
2000.8130.5420.58266.7 %
4000.8830.6900.62478.1 %1.82E-12
8000.9120.7660.62784.1 %9.09E-13
10000.9170.7840.62285.5 %9.09E-13
12000.9190.7780.62284.7 %9.09E-13
14000.9200.7980.61886.7 %9.09E-13
16000.9260.7900.61685.3 %9.09E-13

Complete is D-CLINDE on the complete data. hidden is our proposed algorithm with D-CLINDE on the incomplete data. ignoreHidden is D-CLINDE on the incomplete data. p-value is for one-sided Wilcoxon signed rank test on whether the median F-score of hidden is better than ignoreHidden, and entries larger than 0.1 are omitted. H/C is the ratio of hidden over complete as percentage.

Table 8

Median Delays F-scores of Case III using Multiple Short Time Series with D-CLINDE

n n h p bias K Complete (C)Hidden (H)IgnoreHiddenH/C p-value
5050.6540.5700.2820.37849.5 %
80.7450.4260.53357.2 %
160.8380.6050.60972.1 %
320.8980.7840.65787.3 %9.09E-13
400.9050.8130.65789.8 %9.09E-13
480.9050.8280.65991.4 %9.09E-13
560.9160.8310.65590.7 %9.09E-13
640.9180.8280.65790.2 %9.09E-13
0.7540.6920.3630.48652.4 %
80.8060.5190.59964.4 %
160.8710.7080.63881.2 %1.82E-12
320.9120.7860.64086.2 %9.09E-13
400.9170.8260.64190.1 %9.09E-13
480.9190.8340.63690.8 %9.09E-13
560.9200.8530.63692.7 %9.09E-13
640.9180.8470.62692.3 %9.09E-13
0.8540.7400.4290.53557.9 %
80.8290.5950.61171.8 %
160.8870.7280.63882.0 %8.00E-11
320.9150.8160.63789.2 %1.82E-12
400.9240.8340.63490.2 %9.09E-13
480.9240.8210.63488.9 %9.09E-13
560.9250.8390.62990.7 %9.09E-13
640.9220.8500.63192.3 %9.09E-13
100100.6540.5280.2820.33553.5 %
80.7250.4170.50957.6 %
160.8220.5770.59470.2 %
320.8870.7360.64483.0 %9.09E-13
400.8940.7590.64884.9 %9.09E-13
480.9070.7770.65285.7 %9.09E-13
560.9110.8000.65487.7 %9.09E-13
640.9150.8130.65388.9 %9.09E-13
0.7540.6760.3720.46155.0 %
80.8070.5250.57865.0 %
160.8730.6800.63077.9 %6.93E-10
320.9100.7840.64886.2 %9.09E-13
400.9150.8130.64288.9 %9.09E-13
480.9170.8280.64390.3 %9.09E-13
560.9220.8220.64289.2 %9.09E-13
640.9230.8360.63590.6 %9.09E-13
0.8540.7310.4280.51758.5 %
80.8290.5640.59168.0 %
160.8840.7140.62780.8 %9.09E-13
320.9110.7720.62884.7 %9.09E-13
400.9190.7870.62285.6 %9.09E-13
480.9210.7930.62286.1 %9.09E-13
560.9220.7860.62285.3 %9.09E-13
640.9270.7920.62085.4 %9.09E-13

Complete is D-CLINDE on the complete data. hidden is our proposed algorithm with D-CLINDE on the incomplete data. ignoreHidden is D-CLINDE on the incomplete data. p-value is for one-sided Wilcoxon signed rank test on whether the median F-score of hidden is better than ignoreHidden, and entries larger than 0.1 are omitted. H/C is the ratio of hidden over complete as percentage.

Median delays F-scores of case III using long time series with D-CLINDE Complete is D-CLINDE on the complete data. hidden is our proposed algorithm with D-CLINDE on the incomplete data. ignoreHidden is D-CLINDE on the incomplete data. p-value is for one-sided Wilcoxon signed rank test on whether the median F-score of hidden is better than ignoreHidden, and entries larger than 0.1 are omitted. H/C is the ratio of hidden over complete as percentage. Median Delays F-scores of Case III using Multiple Short Time Series with D-CLINDE Complete is D-CLINDE on the complete data. hidden is our proposed algorithm with D-CLINDE on the incomplete data. ignoreHidden is D-CLINDE on the incomplete data. p-value is for one-sided Wilcoxon signed rank test on whether the median F-score of hidden is better than ignoreHidden, and entries larger than 0.1 are omitted. H/C is the ratio of hidden over complete as percentage. First of all, note that complete can achieve good performance when T or K is large, even though D-CLINDE is only a heuristic. When T≥800 or K≥32, the performance of complete is better than hidden which in turn is better than ignoreHidden, which is as expected. Also, hidden can achieve more than 80 % of the performance of complete. But since having complete data is quite unrealistic in a real world setting, the main comparison of interest is between hidden and ignoreHidden, i.e. between handling or not handling hidden common cause. We see that hidden is significantly (with low p-value) better than ignoreHidden once T≥800 and K≥32. These results show that our proposed algorithm can recover hidden common causes, for larger GRN, and where the number of hidden common causes and the bias in the conditional distributions are unknown.

Random candidate order in clustering

By default, when clustering the candidates, they are considered sequentially from smaller index to larger index. As currently the clustering is a simple greedy algorithm, this raises the question of whether the order affects the resulting networks inferred. For this, we have added the option of using random order, and for the GRNs in case III, for each setting of n, p , T for one long time series and K for multiple short time series. We arbitrarily choose replicate 1 out of the 40 replicates, and repeat the inference using 100 random clustering order. Tables 9 and 10 show the mean and standard deviation of the Links and Delays F-score using one long time series and multiple short time series, respectively. From the results, we see that the F-scores of using different clustering order are very similar, and the standard deviations are all less than 0.06. This suggests that the clustering order does not have great effects on the quality of the resulting networks.
Table 9

Mean and standard deviation of links and delays F-scores of case III using long time series with the proposed algorithm with D-CLINDE

n n h p bias T LF meanLF s.d.DF meanDF s.d.
5050.651000.2790.0340.2790.033
2000.4470.0310.4390.033
4000.5970.0260.5950.025
8000.7490.0250.7420.025
10000.7390.0180.7390.018
12000.7460.0210.7440.022
14000.7500.0190.7490.018
16000.7440.0180.7440.018
0.751000.3440.0350.3430.035
2000.4930.0400.4830.039
4000.7340.0470.7320.047
8000.8890.0300.8770.030
10000.8980.0230.8930.024
12000.9190.0230.9190.023
14000.9140.0150.9140.015
16000.9010.0210.8960.021
0.851000.4620.0460.4610.046
2000.4700.0460.4690.046
4000.7550.0530.7550.053
8000.8070.0350.8070.035
10000.8750.0430.8750.043
12000.8650.0500.8650.050
14000.8910.0360.8910.036
16000.8900.0380.8900.038
100100.651000.3160.0270.3120.027
2000.4000.0250.3980.025
4000.5750.0220.5730.022
8000.7510.0230.7490.023
10000.7290.0180.7270.018
12000.8270.0220.8260.022
14000.8390.0140.8390.014
16000.8250.0190.8200.019
0.751000.4440.0260.4410.026
2000.5690.0270.5670.028
4000.7580.0210.7570.021
8000.7590.0230.7560.023
10000.7690.0270.7680.027
12000.7910.0350.7910.035
14000.8290.0290.8290.029
16000.8190.0290.8170.029
0.851000.4440.0250.4430.025
2000.5030.0310.5020.030
4000.6750.0280.6750.029
8000.7870.0190.7870.019
10000.7740.0220.7730.023
12000.7740.0270.7740.028
14000.7840.0240.7840.024
16000.7890.0220.7880.022

The results are on the incomplete data, using replicate 1 for each setting of n, p and T, with 100 random orders in clustering the candidates. LF is the Links F-score, and DF is the Delays F-score.

Table 10

Mean and standard deviation of links and delays F-scores of case III using multiple short time series with the proposed algorithm with D-CLINDE

n n h p bias K LF meanLF s.d.DF meanDF s.d.
5050.6540.3730.0330.3680.031
80.4260.0320.4210.031
160.6300.0280.6280.027
320.7400.0190.7340.018
400.7710.0230.7630.023
480.7630.0270.7600.026
560.7850.0230.7820.023
640.8020.0270.7890.028
0.7540.3820.0360.3740.035
80.6890.0290.6830.029
160.7520.0310.7490.031
320.8690.0320.8690.033
400.9230.0330.9230.033
480.8980.0320.8980.032
560.9190.0220.9190.022
640.8870.0230.8870.023
0.8540.3520.0480.3510.048
80.4990.0510.4980.050
160.6730.0560.6720.057
320.8080.0490.8070.048
400.8500.0420.8490.042
480.8320.0410.8310.040
560.8700.0350.8670.035
640.8900.0250.8900.025
100100.6540.3120.0290.3090.029
80.4480.0250.4440.025
160.6040.0270.5990.028
320.7470.0290.7380.029
400.7890.0250.7840.025
480.8110.0220.8060.021
560.8010.0260.7950.026
640.8440.0240.8400.025
0.7540.3650.0220.3620.022
80.5520.0250.5510.025
160.6780.0230.6730.023
320.8130.0300.8080.029
400.8480.0220.8480.022
480.8480.0230.8480.023
560.8620.0190.8610.019
640.8490.0210.8490.021
0.8540.4620.0310.4600.031
80.5840.0250.5840.025
160.7080.0290.7080.029
320.7690.0230.7690.023
400.8330.0310.8290.031
480.8050.0360.8010.036
560.8170.0300.8130.030
640.8180.0310.8140.031

The results are on the incomplete data, using replicate 1 for each setting of n, p and K, with 100 random orders in clustering the candidates. LF is the Links F-score, and DF is the Delays F-score.

Mean and standard deviation of links and delays F-scores of case III using long time series with the proposed algorithm with D-CLINDE The results are on the incomplete data, using replicate 1 for each setting of n, p and T, with 100 random orders in clustering the candidates. LF is the Links F-score, and DF is the Delays F-score. Mean and standard deviation of links and delays F-scores of case III using multiple short time series with the proposed algorithm with D-CLINDE The results are on the incomplete data, using replicate 1 for each setting of n, p and K, with 100 random orders in clustering the candidates. LF is the Links F-score, and DF is the Delays F-score.

Different number of iterations in EM

The time series of hidden common causes are estimated using Expectation Maximization (EM) with random initial parameters and restarts, but EM may be sensitive to the initialization. In this subsection we repeat the experiment in case III using different number of EM iterations, namely 100, 200, 500, 1000, 2000 and 5000, to assess the effect of different number of EM iterations. Tables 11 and 12 show the median Delays F-score of our proposed algorithm using D-CLINDE on case III with incomplete data using one long time series and multiple short time series respectively, where the number of EM iterations is varied. From the results, we can see that the median F-scores are very similar when using different number of EM iterations, suggesting that EM has effectively converged. In addition, as mentioned in the previous subsections, our algorithm on incomplete data (hidden) has decent performance, which suggests that EM has converged to a reasonably good (local) solution.
Table 11

Median delays F-scores of case III using long time series with the proposed algorithm with D-CLINDE

n n h p bias T em100em200em500em1000em2000em5000
5050.651000.2590.2520.2620.2590.2690.265
2000.4300.4200.4310.4350.4260.431
4000.5830.5790.5900.5900.5850.585
8000.7530.7580.7520.7570.7590.750
10000.7740.7870.7810.7770.7720.780
12000.8010.7910.7990.8060.8110.805
14000.8150.8240.8230.8220.8250.822
16000.8310.8430.8370.8390.8330.835
0.751000.3610.3620.3600.3560.3540.356
2000.4770.4840.4850.4880.4860.494
4000.6810.6830.6730.6760.6810.681
8000.7890.8030.7870.7820.7950.785
10000.8090.8180.8200.8180.8180.821
12000.8210.8160.8300.8280.8200.831
14000.8270.8350.8300.8310.8340.832
16000.8240.8300.8350.8280.8290.828
0.851000.4120.4220.4240.4220.4190.417
2000.5690.5650.5550.5540.5550.573
4000.7040.7060.7120.7020.7090.702
8000.8060.8050.8030.8030.8010.807
10000.7940.7950.7980.7960.7890.795
12000.8180.8200.8190.8250.8220.820
14000.8240.8220.8220.8130.8190.822
16000.8210.8270.8130.8180.8260.821
100100.651 000.2910.2770.2820.2900.2830.285
2 000.3980.3950.4000.3870.3900.396
4 000.5660.5750.5760.5710.5710.574
8 000.7220.7150.7160.7150.7240.728
10000.7510.7630.7630.7580.7640.757
12000.7830.7870.7870.7900.7820.784
14000.7970.7980.7990.7960.8030.800
16000.7920.8020.7920.8010.7970.794
0.751000.3600.3700.3580.3620.3630.356
2000.5060.5040.5160.5150.5080.514
4000.6880.6900.6890.6920.6890.700
8000.7800.7770.7830.7840.7830.775
10000.8020.7920.7990.7940.8050.806
12000.8110.8150.8120.8050.8140.813
14000.8180.8240.8140.8170.8200.816
16000.8320.8250.8320.8270.8290.828
0.851000.4120.4260.4240.4190.4150.408
2000.5440.5400.5450.5420.5400.538
4000.6950.6890.6900.6900.6910.692
8000.7710.7680.7720.7660.7720.767
10000.7800.7790.7870.7840.7840.781
12000.7760.7690.7760.7780.7760.785
14000.7930.7980.7950.7980.7930.800
16000.7910.7890.7950.7900.7930.796

The results are on the incomplete data, with different number of iterations for the EM. em100 is using 100 EM iterations, em200 is using 200 EM iterations and so on.

Table 12

Median delays F-scores of case III using multiple short time series with the proposed algorithm with D-CLINDE

n n h p bias K em100em200em500em1000em2000em5000
5050.6540.3080.2900.2880.2820.2950.291
80.4330.4420.4320.4260.4210.433
160.6030.6140.6090.6050.6080.604
320.7840.7800.7850.7840.7920.779
400.8090.8190.8180.8130.8180.814
480.8280.8300.8310.8280.8290.836
560.8330.8380.8290.8310.8310.834
640.8400.8330.8340.8280.8370.837
0.7540.3620.3740.3630.3630.3650.372
80.5230.5200.5130.5190.5190.523
160.7060.7030.7080.7080.6990.704
320.7900.7850.7960.7860.7930.790
400.8210.8330.8270.8260.8280.824
480.8380.8340.8350.8340.8270.836
560.8520.8550.8510.8530.8520.850
640.8510.8520.8550.8470.8510.856
0.8540.4440.4310.4250.4290.4240.423
80.5910.5780.5820.5950.5990.599
160.7220.7260.7280.7280.7350.734
320.8010.8100.8060.8160.8120.810
400.8250.8300.8270.8340.8210.832
480.8290.8250.8270.8210.8260.828
560.8360.8370.8330.8390.8380.841
640.8480.8440.8440.8500.8420.846
100100.6540.2810.2840.2770.2820.2850.280
80.4240.4260.4200.4170.4240.426
160.5670.5740.5710.5770.5750.578
320.7310.7320.7300.7360.7390.736
400.7550.7620.7670.7590.7570.763
480.7700.7760.7730.7770.7790.770
560.7970.7970.8050.8000.8010.794
640.8120.8150.8130.8130.8140.809
0.7540.3710.3740.3740.3720.3750.368
80.5230.5250.5190.5250.5280.525
160.6820.6850.6800.6800.6840.681
320.7950.7950.7920.7840.7840.788
400.8150.8160.8150.8130.8200.820
480.8290.8230.8300.8280.8290.826
560.8230.8190.8260.8220.8210.831
640.8380.8370.8380.8360.8360.838
0.8540.4310.4300.4310.4280.4320.435
80.5660.5660.5680.5640.5570.565
160.7010.7090.7060.7140.7050.708
320.7780.7820.7760.7720.7810.789
400.7800.7890.7880.7870.7870.787
480.7900.7940.7970.7930.7960.794
560.7770.7850.7860.7860.7880.789
640.7890.7950.7900.7920.7880.792

The results are on the incomplete data, with different number of iterations for the EM. em100 is using 100 EM iterations, em200 is using 200 EM iterations and so on.

Median delays F-scores of case III using long time series with the proposed algorithm with D-CLINDE The results are on the incomplete data, with different number of iterations for the EM. em100 is using 100 EM iterations, em200 is using 200 EM iterations and so on. Median delays F-scores of case III using multiple short time series with the proposed algorithm with D-CLINDE The results are on the incomplete data, with different number of iterations for the EM. em100 is using 100 EM iterations, em200 is using 200 EM iterations and so on.

Small YEASTRACT subnetworks with real data

Preprocessing of subnetworks

YEASTRACT [53] (http://www.yeastract.com/formfindregulators.php) is accessed to get the regulating TFs of a list of 149 TFs using the “DNA binding and expression evidence” option. 392 links involving only 129 TFs are obtained and we use the “ORF List ⇔ Gene List” utility of YEASTRACT to convert the gene names into ORF id’s, and all 129 id’s appear in the yeast cell cycle [51] data. For the limited data the GRN is still too large, so we have chosen 22 subnetworks with sizes and constituent TFs shown in Table 13. A TF (which has children in the subnetwork) is chosen to be the hidden variable in each subnetwork. Since the delays in the links are not known, we focus on the performance on Links for the demonstration.
Table 13

YEASTRACT Subnetworks

sn n n L Hidden TFOther TFs
sn145MBP1ASH1, HCM1, SWI4
sn2511GLN3DAL80, GAT1, GCN4, UGA3
sn365ADR1IME1, MSN4, PIP2, STE12, USV1
sn465ASH1ACE2, MBP1, SWI5, TOS8, YHP1
sn566YAP6CBF1, CIN5, MOT3, PDR1, TUP1
sn6610MSN2ADR1, FHL1, NRG1, SOK2, USV1
sn7612DAL80GAT1, GLN3, STE12, SUM1, TEC1
sn876ACE2ASH1, FKH2, GAT1, HMS2, INO4, SFL1
sn977MET4ABF1, HAP4, MET28, MET32, SFP1, TYE7
sn1077MSN4ADR1, HAL9, RAP1, ROX1, RPN4, SOK2
sn1177UME6GAT1, GSM1, LEU3, MSN2, OAF1, SIP4
sn1278STE12MIG2, MSN2, PDR1, PDR3, SOK2, YAP1
sn1379CIN5FLO8, IXR1, NRG1, XBP1, YAP1, YAP6
sn14711MCM1MET32, STE12, SWI4, SWI5, TYE7, YAP3
sn15711RAP1FKH1, FKH2, MCM1, SFP1, STE12, SWI5
sn16714FLO8CIN5, HCM1, HMS1, STE12, TEC1, TOS8
sn17912PDR1HAP4, MET28, PDR3, RPN4, SFL1, SWI4, YAP5, YAP6
sn18916RPN4HSF1, MSN2, MSN4, PDR1, PDR3, PUT3, REB1, YAP1
sn191017STE12CBF1, HAP4, MET4, MSN2, PDR1, RAP1, ROX1, SOK2, YAP1
sn201113ABF1DAL81, ECM22, HAP1, HMS2, MET4, MGA1, REB1, RTG3, STP1, SUM1
sn211223STE12ASH1, FLO8, OAF1, RAP1, RFX1, SFP1, SKO1, SOK2, TEC1, TOS8, XBP1
sn221338ROX1FHL1, HAP1, HAP4, HMS1, IXR1, MSN2, MSN4, SKN7, SKO1, STE12, XBP1, YAP1

sn is the subnetwork. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. The hidden TF is the one with expression hidden in incomplete setting of the experiments.

YEASTRACT Subnetworks sn is the subnetwork. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. The hidden TF is the one with expression hidden in incomplete setting of the experiments.

Preprocessing of expression data

The yeast cell cycle [51] data (http://genome-www.stanford.edu/cellcycle/) contains 4 time series: alpha, cdc15, cdc28 and elu, with different lengths and time points, as shown in the second column of Table 2. We perform spline interpolation (using the spline() function in R) to the time points shown in the third column of Table 2 to make the time points equidistant. Some TFs in some series are entirely missing, and we fill in with zero. We rely on the spline interpolation to fill in the value for other missing values. Since we are learning discrete HO-DBN, we perform quantile discretization to discretize the expression data into 3 states, and have prepared two sets for each subnetwork and each time series: complete which contains expression of all TFs of the subnetwork; and incomplete which omits the expression of the chosen hidden variable. Therefore, there are 8 expression datasets for each subnetwork. Since the subnetworks are not large, time is not a major concern, we use our proposed algorithm with D-CLINDE and GlobalMIT+. We test using one of alpha, cdc15, cdc28 and elu, and also using all 4 series. The number of EM iterations is 1000. The maximum delay is 4. The bias p is unknown. For D-CLINDE, we have tried the score thresholds 1, 1.3, 2, 2.3 and 3. For GlobalMIT+, we have tried the α values 0.9, 0.95, 0.99, 0.995 and 0.999. Table 14 shows the best Links F-score over series used and parameters tested, where complete is D-CLINDE or GlobalMIT+ alone on the complete data, hidden is our proposed algorithm on incomplete data, and ignoreHidden is D-CLINDE or GlobalMIT+ alone on the incomplete data.
Table 14

Best links F-scores of YEASTRACT subnetworks using our proposed algorithm with D-CLINDE and GlobalMIT+

D-CLINDEGlobalMIT+
sn n n L CompleteHiddenIgnoreHiddenCompleteHiddenIgnoreHidden
sn1450.6000.5710.5710.2860.5000.267
sn25110.5330.4290.4290.4530.6590.421
sn3650.3330.5710.0000.4000.3640.000
sn4650.3640.3080.0000.2670.4000.000
sn5660.4000.5000.0000.2500.3640.000
sn66100.4140.3870.4000.3430.4800.286
sn76120.4290.4760.4300.3530.3160.267
sn8760.5710.6670.0000.4440.3810.000
sn9770.2670.5880.0000.5450.4440.222
sn10770.3640.2860.0000.4620.4000.000
sn11770.2500.3640.0000.2860.2860.000
sn12780.4620.6670.5000.2860.3080.333
sn13790.3810.6770.1330.3640.6360.000
sn147110.2500.5940.2670.2500.5000.250
sn157110.3610.4110.3610.2500.3160.200
sn167140.3200.3330.2220.2580.3080.207
sn179120.2220.4440.1250.3250.5220.154
sn189160.2930.4040.1900.2990.3330.167
sn1910170.1740.2860.1820.1950.2890.195
sn2011130.2140.7780.1480.2000.5680.105
sn2112230.2160.2500.1080.2050.3210.212
sn2213380.2260.2100.1950.1800.2520.183

Complete is D-CLINDE or GlobalMIT+ on the complete data. hidden is our proposed algorithm on the incomplete data (without the hidden node). ignoreHidden is D-CLINDE or GlobalMIT+ on the incomplete data.

Best links F-scores of YEASTRACT subnetworks using our proposed algorithm with D-CLINDE and GlobalMIT+ Complete is D-CLINDE or GlobalMIT+ on the complete data. hidden is our proposed algorithm on the incomplete data (without the hidden node). ignoreHidden is D-CLINDE or GlobalMIT+ on the incomplete data. When using D-CLINDE, hidden is better than ignoreHidden in 19 out of 22 subnetworks, has ties in 2 subnetworks, and is worse in 1. When using GlobalMIT+, hidden is better than ignoreHidden in 21 out of 22 subnetworks, and worse in 1. This shows that our proposed algorithm helps to infer more accurate GRN from limited data, because it considers the possibility of hidden common cause. Also note that hidden is sometimes even better than complete, which is counter-intuitive. This suggests that the given data is insufficient to enable robust GRN inference even given the complete data. Another possible reason is that our algorithm makes the assumption that the children of a predicted hidden common cause are not linked to each other, which may help the GRN inference in the limited data case. If more time points are available, we would expect the situation to be more like the synthetic data case, where complete has slightly better performance than hidden. As mentioned before, since the real data is very limited, we cannot draw strong conclusion for the YEASTRACT subnetworks, but the results suggest that our proposed algorithm has potential in helping to recover hidden common causes in real GRN, but likely more data is needed.

Conclusions

In this paper, we have developed an algorithm to infer from expression data the transition network of a discrete HO-DBN which may have a small but unknown number of hidden common causes, with some assumptions on the conditional distributions in the HO-DBN. We have tested our algorithm on 3 types of synthetic data: small GRN with one hidden node, small GRN with no hidden node, and large GRN with a small but unknown number of hidden nodes. Experiment results show that our proposed algorithm can recover the causal GRN adequately given the incomplete data. Using the limited real expression data and small YEASTRACT subnetworks, we have also demonstrated the potential of our algorithm to recover hidden common causes in real data, but more time series expression data is needed.

Future works

For future work, we intend to develop more sophisticated clustering of candidate genes with hidden common cause(s), instead of using the simple greedy heuristic. In addition, we intend to investigate different methods of specifying the similarity threshold S 0. Currently the similarity of two time series is measured by the maximum − log10(p−v a l u e) of G 2 tests of the two shifted time series. Therefore, the threshold S 0 is already related to p-value, and this helps to set an appropriate value for S 0. In order to more formally set the threshold S 0, one way would be to obtain the distribution of the similarity score when two time series are unrelated, and then a threshold could be set such that the probability of incorrectly putting two unrelated time series into the same cluster is controlled. However, obtaining the theoretical distribution may not be straightforward. Alternatively, the empirical distribution of the similarity score could be used. For example, first randomly choose two time series, then randomly permute the time points of one series, and calculate the similarity score. This could be repeated a large number of times to give an empirical distribution of the similarity score, and an appropriate threshold could be set accordingly. Also, we intend to study how to relax the assumptions on the conditional distributions. Another issue worth pursuing is to better decide the number of states of hidden common causes, for example, the techniques in [49] could be incorporated. Also, estimating the bias in the conditional distributions is an important part of the proposed algorithm, as we rely on this to predict the genes with hidden common causes. In this study, we use the maximum probability as the estimate of the bias for each conditional distribution (conditional on one configuration of the parent(s)), and use the median of the estimated biases as the estimate of the bias of a gene. If a gene has many configurations of parents, some cells in the contingency table may not have enough data points for proper estimation of the conditional distribution, and consequently the estimation of the bias may be affected. We use median as a simple strategy alleviate this problem, in the hope that more than half of the conditional distributions of a gene have proper estimation of bias. One possible alternative strategy is to use a Bayesian model, where there is an overall unknown parameter p with a prior distribution; and given this parameter, for each gene and each configuration of the gene’s parent(s), the condition distribution has a (possibly different) dominate state which has probability p , and other states share the remaining 1−p equally; and these conditional distributions produce the observed time series expression data. Under this model, we may attempt to estimate the posterior distribution of the parameter p , given the observed time series expression. This may better solve the issue of insufficient data points in some cells of the contingency table for the estimation of the bias. We intend to study this in more depth as future work. In this study, we learn an initial GRN, estimate the hidden common cause(s), and re-learn the GRN to give the final GRN if any hidden common cause is learnt (steps 1 to 4 in Fig. 2). It would be interesting to see if iterating the steps 2 to 4 would allow more hidden nodes to be estimated. If the hidden common causes of the observed genes are estimated sufficiently well after the first iteration, the hidden common cause(s) (if any) of these hidden common causes might be estimated in the next iteration. However, this may be difficult, because we do not expect the true hidden time series of common causes to be recovered from the estimation. More realistically, the estimated time series may have discrepancy with the true time series, though may still allow the causal relationships of the hidden variable with other observed variables to be recovered adequately. Consequently, further estimating the hidden common causes of estimated hidden common causes would be more difficult. This is an interesting direction to investigate as future work.

Availability of supporting data

The data set(s) supporting the results of this article is(are) included within the article (and its additional file(s)).
  27 in total

1.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements.

Authors:  A J Butte; I S Kohane
Journal:  Pac Symp Biocomput       Date:  2000

2.  Autoinhibition with transcriptional delay: a simple mechanism for the zebrafish somitogenesis oscillator.

Authors:  Julian Lewis
Journal:  Curr Biol       Date:  2003-08-19       Impact factor: 10.834

3.  Inferring Time-Delayed Causal Gene Network Using Time-Series Expression Data.

Authors:  Leung-Yau Lo; Kwong-Sak Leung; Kin-Hong Lee
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2015 Sep-Oct       Impact factor: 3.710

Review 4.  Intron delays and transcriptional timing during development.

Authors:  Ian A Swinburne; Pamela A Silver
Journal:  Dev Cell       Date:  2008-03       Impact factor: 12.270

5.  Rate of translation of natural mRNAs in an optimized in vitro system.

Authors:  M Y Pavlov; M Ehrenberg
Journal:  Arch Biochem Biophys       Date:  1996-04-01       Impact factor: 4.013

6.  Inferring Boolean network structure via correlation.

Authors:  Markus Maucher; Barbara Kracher; Michael Kühl; Hans A Kestler
Journal:  Bioinformatics       Date:  2011-04-05       Impact factor: 6.937

7.  TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach.

Authors:  Pietro Zoppoli; Sandro Morganella; Michele Ceccarelli
Journal:  BMC Bioinformatics       Date:  2010-03-25       Impact factor: 3.169

8.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Authors:  P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher
Journal:  Mol Biol Cell       Date:  1998-12       Impact factor: 4.138

9.  Gene regulatory network modeling via global optimization of high-order dynamic Bayesian network.

Authors:  Nguyen Vinh Xuan; Madhu Chetty; Ross Coppel; Pramod P Wangikar
Journal:  BMC Bioinformatics       Date:  2012-06-13       Impact factor: 3.169

10.  The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae.

Authors:  Miguel Cacho Teixeira; Pedro Tiago Monteiro; Joana Fernandes Guerreiro; Joana Pinho Gonçalves; Nuno Pereira Mira; Sandra Costa dos Santos; Tânia Rodrigues Cabrito; Margarida Palma; Catarina Costa; Alexandre Paulo Francisco; Sara Cordeiro Madeira; Arlindo Limede Oliveira; Ana Teresa Freitas; Isabel Sá-Correia
Journal:  Nucleic Acids Res       Date:  2013-10-28       Impact factor: 16.971

View more
  4 in total

1.  New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data.

Authors:  Grigoriy Gogoshin; Eric Boerwinkle; Andrei S Rodin
Journal:  J Comput Biol       Date:  2016-09-28       Impact factor: 1.479

2.  A new gene regulatory network model based on BP algorithm for interrogating differentially expressed genes of Sea Urchin.

Authors:  Longlong Liu; Tingting Zhao; Meng Ma; Yan Wang
Journal:  Springerplus       Date:  2016-11-03

3.  Analysis of high-resolution 3D intrachromosomal interactions aided by Bayesian network modeling.

Authors:  Xizhe Zhang; Sergio Branciamore; Grigoriy Gogoshin; Andrei S Rodin; Arthur D Riggs
Journal:  Proc Natl Acad Sci U S A       Date:  2017-11-13       Impact factor: 11.205

4.  Identifying large-scale interaction atlases using probabilistic graphs and external knowledge.

Authors:  Sree K Chanumolu; Hasan H Otu
Journal:  J Clin Transl Sci       Date:  2022-02-11
  4 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.