Literature DB >> 26394325

Time Delayed Causal Gene Regulatory Network Inference with Hidden Common Causes.

Leung-Yau Lo1, Man-Leung Wong2, Kin-Hong Lee1, Kwong-Sak Leung1.   

Abstract

Inferring the gene regulatory network (GRN) is crucial to understanding the working of the cell. Many computational methods attempt to infer the GRN from time series expression data, instead of through expensive and time-consuming experiments. However, existing methods make the convenient but unrealistic assumption of causal sufficiency, i.e. all the relevant factors in the causal network have been observed and there are no unobserved common cause. In principle, in the real world, it is impossible to be certain that all relevant factors or common causes have been observed, because some factors may not have been conceived of, and therefore are impossible to measure. In view of this, we have developed a novel algorithm named HCC-CLINDE to infer an GRN from time series data allowing the presence of hidden common cause(s). We assume there is a sparse causal graph (possibly with cycles) of interest, where the variables are continuous and each causal link has a delay (possibly more than one time step). A small but unknown number of variables are not observed. Each unobserved variable has only observed variables as children and parents, with at least two children, and the children are not linked to each other. Since it is difficult to obtain very long time series, our algorithm is also capable of utilizing multiple short time series, which is more realistic. To our knowledge, our algorithm is far less restrictive than previous works. We have performed extensive experiments using synthetic data on GRNs of size up to 100, with up to 10 hidden nodes. The results show that our algorithm can adequately recover the true causal GRN and is robust to slight deviation from Gaussian distribution in the error terms. We have also demonstrated the potential of our algorithm on small YEASTRACT subnetworks using limited real data.

Entities:  

Mesh:

Substances:

Year:  2015        PMID: 26394325      PMCID: PMC4578777          DOI: 10.1371/journal.pone.0138596

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Knowing the gene regulatory network (GRN) in the cell is crucial to understanding the working of the cell. In the cell, some proteins are transcription factors (TFs) which trigger or inhibit the transcription of other gene(s), which are then translated into proteins with a delay. These delays have been known to affect the network stability, or cause oscillations [1-4]. A simplistic view of the GRN is a directed network resulting from the complex causal interactions between genes, where each directed link is labeled with the delay. Since experimentally determining the regulatory targets of each TF is expensive and time-consuming, there have been many computational methods that attempt to utilize high-throughput microarray and RNA-seq gene expression data to infer the GRN. High-throughput technology such as microarray or RNA-seq allows the expression of thousands of genes to be measured at the same time, and allows time series expression data to be obtained when this is done for a number of time points. Even though many GRN inference methods have been developed, to our knowledge, they all implicitly make the assumption of causal sufficiency, i.e. all the relevant factors in the causal network have been observed and there are no unobserved common cause. This assumption is convenient, but very unrealistic. For example, miRNAs were previously not thought to take important roles in gene regulation. In principle, in the real world, it is impossible to be certain that all relevant factors or common causes have been observed, because some factors may not have been conceived of, and therefore are impossible to measure.

Gene Network Inference

There have been many GRN inference algorithms and models, with different levels of details, see [5, 6] for surveys of GRN modelling and [7] for survey on GRN inference algorithms for microarray expression data. Due to the nature of GRN, most models of GRN could be described as a graph, where the vertices are the genes under consideration, and the edges represent the regulatory relationships. Different levels of details could be achieved by labeling the edges with extra information. In the simplest case, an undirected graph could be used, in which case only an association network of the genes is captured. ARACNE [8] infers undirected network using mutual information, but also uses Data Processing Inequality to try to eliminate indirect interactions. C3NET [9] first identify gene pairs with significant mutual information, then link each gene to the neighbor (if any) with highest mutual information, and output an undirected and conservative network, with no other explicit means of eliminating indirect effect. But without direction in the edges, there is no causal interpretation. On the other hand, directed edges could be used, as in [10], which uses genetic algorithm to optimize a score based on partial correlation, estimated regulatory direction and effect, but the output edges are not labeled with time delays. Since the process of gene transcription and mRNA translation both take time, and non-negligible translational time delays have been observed [11, 12]. Moreover, RNA polymerase, a main working protein in transcription, has been observed to pause during transcription, adding a cumulative of 204–307s over a 2.3kb region [13]. It is known that these delays affect the network stability, or cause oscillations [1-4]. Therefore, these delays should also be taken into account for a more accurate GRN model. Some algorithms consider delay of only one time step, as in [14], which considers discretized expression data, and uses association rule mining to find frequent regulatory patterns, but without eliminating indirection association. Boolean network, e.g. in [15], is a classic model of GRN where the expression of each gene is discretized to only “on” or “off”, and the expression of each gene at the next time step is a boolean function of expression of its regulators at the current time step. Another popular class of GRN model is Ordinary Differential Equations (ODE), where the rate of change of expression of each gene is a (linear or nonlinear) function of the expression of the gene and its regulators. When discretized in time, it reduces to one time step model. Examples are [16], which uses Gaussian process for Bayesian inference of an ODE model; and DELDBN [17], which combines ODE model with local Bayesian analysis, and uses estimated Markov blanket as the regulators of each gene. There are also Dynamic Bayesian Network (DBN) based models, which avoids the limitation of plain Bayesian network that no cycles are allowed. An example is [18], which utilizes Bayesian structural expectation maximization to infer a one time step DBN model. There are relatively few algorithms that infer multiple time delays. [19] first estimates the possible delays from pairwise mutual information from discretized expression data, then infer multiple time step DBN by minimizing MDL score using genetic algorithm. Banjo [20] also optimizes a score metric on DBN using discretized expression data by MCMC based method, and the program later allows multiple delay. TD-ARACNE [21] is an extension of ARACNE with time delays. But these algorithms do not label the edges of GRN with regulatory effect. In contrast, in DD-lasso [22], the expression of a gene is a linear combination of expression of its regulators at (possibly different) previous time steps. It first estimates the delays between each gene pairs by maximum likelihood, then uses lasso [23] to remove indirect effects and estimate the coefficients, therefore the edges are labeled with the delays as well as the regulatory effect. CLINDE [24] uses a similar model, but uses conditional independence of the shifted time series to estimate the delays and eliminate indirect effects. Some other algorithms use perturbation data, or use a combination of perturbation data and time series expression. [25] is a parallel implementation of the Network Identification by multiple Regression (NIR) algorithm utilizing perturbation data. [26] needs promoter sequence and TF binding site information in addition to (non-time series) expression data. [27] is an Inductive Causation (IC) [28, 29] based method, which uses steady state data, with partial prior knowledge of ordering of regulatory relationship, and uses entropy to test conditional independence, giving an acyclic network where some edges may remain undirected. [30] uses convex programming on an ODE model using perturbation data. TSNI [31] solves a discretized linear ODE model using time series expression data after each gene is perturbed. [32] uses a Dynamic Nested Effects Model using perturbation data, where the delays are assumed to have exponential distribution. [33] uses knockdown data for Boolean network with exponentially distributed time delays.

Causality Inference

Outside of GRN inference, there have been quite a lot of works in the last two decades on inferring a causal network from observational data. Under the framework, the unknown causal relationships are represented by a directed graph, and the observed correlations and partial correlations are related to d-separations among the variables (through statistical tests), and therefore constrain the possible structures of the graph. In the basic framework, correlations are used to give possible causal links (without direction), and partial correlations are used to remove indirect links, finally the assumption of acyclicity gives rise to orientation rules to give directions to the links. Therefore, sometimes some links would remain undirected, as both directions are possible. These researches have been well summarized by [28] and [29]. Most works focused on causal structures represented as directed acyclic graph (DAG), because time delays are not considered, and the data used are not time series data. [34] attempted to extend the method for acyclic graph to cyclic graph, with more complex orientation rules. More recently, LiNGAM [35] formulated the causality inference as Independent Component Analysis (ICA) for the linear non-Gaussian acyclic structural equation model, and [36] extended it to the cyclic case. [37] extended LiNGAM to add time delays by first fitting an autoregressive model for the time delayed effects, and then fit the residuals with LiNGAM. [38] revisited the concept of Granger Causality. [39] generalized Bayesian network structure learning to cyclic structure.

Hidden Common Cause

Hidden (latent) variables have been an important topic in causality inference. The problem is that when hidden common causes are ignored, the causality inferred could be misleading. This is illustrated in Fig 1, where some nodes are wrongly thought to be causally linked.
Fig 1

Example of Misleading Inference if Hidden Common Cause is Ignored.

The number on the link is the delay, and + or − is the sign of the effect.

Example of Misleading Inference if Hidden Common Cause is Ignored.

The number on the link is the delay, and + or − is the sign of the effect. [40] is an early work that formulates the problem as determining the constraints on the variance-covariance matrix of observed data, then searching for a causal structure that would explain the constraints. Some works assume the presence of hidden common cause of observed variables, but focus only on the relationship of observed variables, with indication that some may have hidden common cause. For example, [41] is a Granger-causality based method that learns a mixed graph from time series data, where directed edges represent direct causal relationship, and dashed edges represent relationship due to hidden common cause. [42] is an extension of the FCI algorithm [29] and also outputs a mixed graph, but does not need time series data. [43] is another extension of the FCI algorithm, and learns an acyclic network with no time delays, and with no consideration that the latent variables may have observed variables as parents. In [44], a stochastic differential equation model (discretized in time) is used, where hidden variables are assumed, but only to more accurately estimate the relationship between observed variables, by a convex optimization based method. In [45], the d-separation and d-connection information, which are in practice provided by conditional tests, are encoded as a Satisfiability (SAT) problem, which is then incrementally solved to attempt to recover the dependency structure between observed variables, with indication that some have latent variables, but some edges may be marked as “unknown” if the given information cannot determine whether it is present or absent. [46] uses nested effects models using perturbation data with no time delays, and hidden common effect of two observed variables may be predicted, and some edges indicate possible presence of hidden nodes. Some works assume that hidden variables may have other hidden variables as parents, but cannot have observed variables as parents. [47] and [48] are works in this direction, where observed variables are linear function of its parents (either hidden or observed), and hidden variables can be nonlinear function of its parents (only hidden variables). In [49], a linear Bayesian network is learnt, but it is assumed that there are no edges among observed variables, and that the hidden variables are linearly independent. On the other hand, some works are more general and allow the hidden variables to have observed variables as parents. [50] uses a dynamic Bayesian network (DBN) model, and is based on the observation that in DBN, ignoring hidden variable usually results in violation of Markov property. Based on this, the algorithm tries to find non-markovian correlations (those across more than one time step) and introduce hidden variable to explain them. However, this work does not evaluate how close the resulting dependency structure is to the assumed causal structure, but only focuses on the likelihood on the testing set. [51] learns a discrete Bayesian network with hidden variable without time delays. It assumes that a hidden variable has observed variables as parents and children. It is motivated with the observation that when the hidden variable is not taken into account, the inferred dependency of the observed variables (the parents and children of the hidden variable) will usually be overly complicated, with many connections. Based on this, they find structural signature (semi-clique) to guess the position of the hidden variable(s) and its local structure, then make adjustment by explicitly linking the variables of the semi-clique with the introduced hidden variable. This local structure may then be fine-tuned by Structural-EM (SEM). This work also focuses on the likelihood in the evaluation, but not the inferred structure. This also places some restrictions on the subnetwork surrounding the hidden variable(s), e.g. a hidden variable must have parent(s), which are observed variable(s), also the total number of parents and children of a hidden variable must be at least four, as the smallest semi-clique has size four. [52] complements [51] and focuses on learning the dimensionality (the number of states) of hidden variables.

Objective

First of all, it is important to emphasize that even inferring the causal relationships of observed variables is highly non-trivial, so inferring causal relationships of hidden variable(s) is obviously much more difficult and it is not possible to recover all possible cases of hidden variables. In some cases, the hidden variable is not very interesting, and in some cases, it would be too difficult to recover. For instance, if a hidden variable has no children, or one child with zero or one parent, it is not very interesting and is difficult to detect and estimate. The case that is feasible to handle is a hidden variable with two or more observed children, with or without parents, so we focus on this case. In this paper, we assume that there is a sparse causal graph (possibly with cycles) of interest, where the variables are continuous and each causal link has a delay (possibly more than one time step). A small but unknown number of variables are not observed. Each unobserved (hidden) variable has only observed variables as children and parents, with at least two children and possibly no parents, and the children of unobserved variable(s) are not linked to each other. Although it is conceivable that two children of a hidden variable may be linked, so that a child has both the other child and the hidden variable as parents, it is difficult to differentiate whether the high correlation between two children are solely due to the hidden common cause, or also due to the presence of direct link. Therefore, we make the simplifying assumption that the children of hidden variable(s) are not linked to each other. Our objective is to infer the causal graph with the delays, given the time series of the observed variables. Since it is difficult to obtain very long time series, it is more realistic that the algorithm be also capable of utilizing multiple short time series, which we call segments in this paper. The segments are not necessarily of the same length (e.g. obtained from replicate experiments). To our knowledge, previous works either make much more restrictive assumptions or give different types of output (e.g. no delays in the links).

Methods

The overall flow of the proposed method is given in Fig 2. The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the variance of the error terms, 3) estimate the hidden common cause(s), 4) infer the parents and children of the hidden common causes. The program code is written in C, any parameters described in this paper has a default value, and can be changed as needed. Also, sample running time is provided to give an idea of the running time needed. The program can be obtained at https://github.com/peter19852001/hcc_clinde. Below we first describe the data and model assumed in this paper, then describe each step in more details, where we first describe the case with one segment, and later we describe the case of multiple segments in a separate subsection.
Fig 2

Overall Flow of the Algorithm.

The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the variance of the error terms, 3) estimate the hidden common cause(s), 4) infer the parents and children of the hidden common causes.

Overall Flow of the Algorithm.

The steps are 1) infer an initial GRN of the observed genes, 2) determine the genes with hidden common cause(s) by the variance of the error terms, 3) estimate the hidden common cause(s), 4) infer the parents and children of the hidden common causes.

Data and Model

The GRN model assumed here is: where x (t) is the expression value of gene i at time t, i = 1,…,n + n , t = 1,…,m, and there are n observed genes, n hidden variable(s) and m equidistant time points. n is unknown, but . a is the regulatory effect of gene i on gene j, where the regulatory effect is repressive if a is negative, activatory if positive, and absent if zero. a = 0 for , i.e. there are no causal links between hidden variables. for , ∣{j : a ≠ 0}∣ ≥ 2, i.e. each hidden variable has at least two observed genes as children. if a gene has a hidden parent, it has no other parents. τ is the positive time delay of the edge i → j (if a ≠ 0). ϵ (t) is the error term for gene j at time t. We assume that E(ϵ (t)) = 0 and Var(ϵ (t)) = σ 2, i.e. the error terms are zero-mean and have fixed variance. They are also assumed to be mutually independent, but otherwise we do not make stringent assumptions on the distribution of the error terms. Note that this model does not preclude self-regulation or cycles in the GRN, though any cycles must have positive delays. The given data is {x (t)}, for i = 1,…,n. If the raw input data does not have equidistant time points, interpolation (e.g. spline interpolation) could be performed as preprocessing before using this algorithm.

Initial GRN

The first step is to obtain an initial GRN. There are not many GRN inference methods that handles multiple time delays, CLINDE [24] and DD-lasso [22] are two of them, and both handles multiple short time series. We choose CLINDE to infer the initial GRN, based on the comparison in [24], which shows that CLINDE outperforms DD-lasso for smaller number of time points relative to the number of genes. Also because CLINDE does not restrict the multiple time series to be of the same length, unlike DD-lasso. CLINDE is based on the PC algorithm [29], and consists of two stages. Stage 1 considers all (directed) pairs of genes x and y, and considers all possible delays d up to a maximum allowed delay, to determine if x → y is significant with the delay d based either on a correlation test, or mutual information test. The test is considered significant if the score of the test is larger than a score threshold. In the correlation test and partial correlation test below, the score is −log10(p-value), and in the mutual information test and conditional mutual information test below, the score is the (conditional) mutual information. So a higher score threshold means a more stringent test. And for correlation test, the regulatory effect (positive or negative) is also estimated if the edge is significant. After stage 1, there may be multiple edges from x to y, but with different delays. Also note that there may be cycles, but any cycle must have positive delays. Stage 2 attempts to prune the edges by partial correlation test or conditional mutual information test. Iteratively, the remaining edges are considered for pruning by conditioning first on h = 1 neighbor, then h = 2 neighbors, and so on up to h = N 0, for a given parameter N 0. In each iteration, each remaining edge is tested by conditioning on h neighbors, shifted properly using the delays estimated in stage 1. If the conditional test is not significant, the edge is pruned. After stage 2, the remaining edges are output as the GRN. Note that the pruning removes only indirect effects, and with sufficient data (so that the conditional tests truely reflects conditional independence relationships) any genuine cycles should remain, because each edge in the cycle is a direct effect. In this paper, we use (partial) correlation tests, which is the default for CLINDE. And after stage 2, the links with zero estimated delays are discarded to get the initial GRN for the following steps, as we assume the delays are positive. Since CLINDE can infer cyclic GRN, so by using CLINDE for the initial GRN, HCC-CLINDE can handle cyclic GRN without further special handling.

Identification and Estimation of Hidden Common Cause

After the initial GRN of the observed gene is obtained, for each gene, we can regress its expression on the shifted expression of its parents and obtain the corresponding error terms. Therefore we can calculate the empirical variance of the error terms. The idea is that for genes with hidden common causes, in the initial GRN, its parents would be determined incorrectly (likely the parent(s) or other children of the hidden variable) because the hidden variable is not observed, and consequently the variance of the error term is likely to differ from the expected fixed variance. On the other hand, if a gene has no hidden variable as parent, its parents would hopefully be determined correctly, and therefore the error term would have the expected variance. The variance of the error terms therefore provides a clue to whether a gene has hidden variable as parent. To illustrate, consider the case depicted in Fig 3, where X and Y are independent and H is hidden: Suppose that in the initial GRN, the parents of Z are X and Y, so that Z = c(aX + bY + e ) + e , and the variance of the error terms would be . Similarly, if the parent of Z was incorrectly determined to be W, whereas we have , and the variance of the error terms would be . In both cases, the variance of the fitted error terms would be larger than expected. On the other hand, in the initial GRN, if too many false parents were predicted for a gene, its error terms may have smaller variance than expected, but this is not common in CLINDE. In this paper, we use a simple thresholding to decide if a gene may have hidden parents: , where Var(e) is the observed variance of the error terms of a gene, ρ ≥ 0 (with a default value of 0.1) is the tolerance, and σ 2 is the expected variance. Those genes determined to have hidden parents are called candidates.
Fig 3

Example of Hidden Node.

X and Y are independent. H is hidden.

Example of Hidden Node.

X and Y are independent. H is hidden. If the number of observed genes n is small, we assume that the expected variance σ 2 is known and given. In contrast, if n is larger, we could instead estimate σ 2 from the observed variances of the error terms of the genes. For each observed gene, we can calculate the variance of the error terms based on the initial GRN. Based on the assumption that there are only a small number of hidden variables relative to the number of observed genes, in this paper we simply use the median of these variances as the estimate of the expected variance. If there are no candidates, we output the initial GRN as the final GRN. Otherwise, we cluster them to estimate the hidden common cause for each cluster, based on the fact that genes with common parent are correlated. From our preliminary tests, we found that even a simple greedy clustering algorithm works adequately: where c is the center of cluster i, C is cluster i. d(x,y) measures the similarity of two time series x and y, here we use the maximum absolute correlation of the shifted time series (shift y relative to x, from −τ 0 to τ 0, where τ 0 is the maximum delay). Let the k candidates be {g 1,g 2,…,g } Set n ← 1, c 1 ← g 1, τ 1 ← 0, C 1 ← {g 1} Set ρ 0 ← (1+ρ)e 2, where e 2 is the expected variance of the error terms, and ρ is the tolerance For i = 2,…,k Let d = argmax1 ≤ d(c ,g ), and set τ be the associated time shift of g relative to c If , update C ← C ∪{g } Otherwise, set n ← n + 1, then set C ← {g }, τ ← 0 Output the n clusters {C : 1 ≤ j ≤ n }, and the time shifts {τ : 1 ≤ i ≤ k} The basic idea is that for each series, find out which cluster is the most similar to it, and if the similarity is high enough, it is included in that cluster; otherwise it becomes a new cluster. The similarity threshold is obtained from simple calculations as follows. Consider Fig 3 for Z and W, where Z = cH + e and W = dH + e , so we have: Since our threshold of the variance for e and e is ρ 0 = (1 + ρ)e 2, for two genes z and w having a hidden common cause, we have the corresponding lower threshold of correlation as After the clustering, if there are no clusters with size larger than one, there is no need to estimate the hidden common causes, and we simply output the initial GRN as the final GRN. Otherwise, we would estimate a hidden common cause for each cluster C where . Suppose C = {g ,g …,g }, we first center the expression of these genes to get . Since each gene g has the associated time shift τ relative to the cluster center, we wish to estimate the coefficients and the hidden common cause h(t) where . Note that the scale of and h(t) is undetermined as for any β ≠ 0. Also, since the time series may not be aligned (see Fig 4), so we first estimate the overlapping part, then estimate the prefix and suffix.
Fig 4

Illustration of Estimation of Hidden Common Cause from Un-aligned Time Series.

Suppose that relative to the cluster center, the overlapping parts are t ≤ t ≤ t , let Y = [y ] for 1 ≤ k ≤ ∣C ∣ and t ≤ t ≤ t and , and let for t ≤ t ≤ t , we want , as illustrated in Fig 4. This is conveniently solved by Singular Value Decomposition of Y as low rank approximation to get where σ is the largest singular value and the Frobenius norm of E is minimized. We then arbitrarily take and . Having estimated the coefficients , we still need to estimate the non-overlapping prefix () and suffix () of h (t), if any. For or , we estimate h (t) by minimizing where the sum is over k for which 1 ≤ t−τ ≤ m. This is readily solved to be . Lastly, we take for 1 ≤ t ≤ m−1 and as the estimate of the hidden common cause of cluster C , i.e. take the suffix of h (t) and shift by 1 so that precedes all the genes in C in time.

Inferring Causal Relationships of Estimated Hidden Common Cause

After estimating the hidden common cause(s), we attempt to re-estimate the subnetwork of the introduced hidden common cause(s), the candidates, and the parents of the candidates. For this, we apply CLINDE, with the restriction that in stage 1 of CLINDE, we do not consider the links in-between the candidates and links into the parents of the candidates, and consequently the resulting subnetwork will only consists of links from parents of candidates to hidden common causes and candidates, and from hidden common causes to candidates. Having obtained the subnetwork, we remove the links into the candidates in the initial GRN, and then union the resulting GRN with the subnetwork to get the final GRN. The rationale is that we expect the candidates to have hidden common causes as the real parents, where their apparent parents in the initial GRN may really be parents of the hidden common cause(s). Also, as the hidden common cause is estimated from the observed children, it is difficult to differentiate whether the high correlation between two children of a hidden variable is solely due to the hidden common cause, or that there is also a direct link between the two. In this paper, we therefore make the simplifying assumption that the children of a hidden variable are not linked to each other.

Handling Multiple Segments of Time Series Data

The above describes the steps of HCC-CLINDE when one segment of time series data is provided, we now consider the case where multiple segments of time series data are provided. We do not assume that the segments have the same lengths, but we assume that the time steps of different segments are the same. The main idea of handling multiple segments is that when the expression of genes have to be shifted by a delay (e.g. for calculating correlation), all the segments are shifted, and the overlapping parts are concatenated for the calculation. This is illustrated in Fig 5.
Fig 5

Illustration of Handling Multiple Segments of Time Series.

For inferring the initial GRN, we use CLINDE, which can handle multiple segments, in the same manner illustrated in Fig 5 for calculating correlation and partial correlation. For identifying the candidate genes with hidden common cause, we need to first regress the expression of a gene on its parents (from the initial GRN), which is done by shifting multiple segments by the estimated delays using the overlapping parts for regression. After that, the clustering poses no difficulty as it only needs the correlation, which can be calculated in the same manner as in the initial GRN. For estimating the hidden common cause(s) using SVD, we first shift and take the overlapping parts of the segments, and use SVD to estimate the center part of the hidden common cause for each segment and estimate the coefficients, then we go through each segment and estimate the prefix and suffix. This is illustrated in Fig 4. Lastly, after the estimation of hidden common cause(s), we infer the causal links for the hidden common cause(s) by CLINDE on subnetwork, which poses no difficulty as CLINDE can handle multiple segments.

Experiment Results

This section evaluates the effectiveness of our proposed algorithm HCC-CLINDE. We mainly rely on synthetic data, where we know the underlying gene network and the hidden variables, and the lack of sufficient expression data is not a concern. Since to our knowledge, there is no previous work that is similar to ours (infer hidden common cause with time delays), we can only compare our algorithm on incomplete data with CLINDE on incomplete and complete data, to show the improvement over ignoring hidden common cause. We have tested on three types of synthetic data: 1) small GRN with one hidden node and the variance of error terms is known; 2) small GRN without hidden node and the variance of error terms is known; 3) large GRN with more than one hidden node and the variance of error terms is unknown. For all three cases, we try the score thresholds (st) 2, 3 and 4 for CLINDE, and use the default value for other CLINDE parameters. And for each case, we generate two types of data: one long segment of time series where we take prefixes of different lengths; and multiple segments where we use different number of segments for different total number of time points. Due to the lack of long time series expression real data, we are unable to test our algorithm on large GRN, but we could demonstrate our algorithm on small GRNs. A dataset with relatively large number of time points is [53], which measures the expression of over 6,000 genes of Saccharomyces cerevisiae using DNA microarrays, with three different methods of synchronization for studying yeast cell cycle. Together with previous data from [54] (also included in [53]), we have 4 time series with information shown in Table 1. For the GRN, we use YEASTRACT [55], which is a curated database of over 200,000 transcription regulatory associations in Saccharomyces cerevisiae (including 113 TFs). Since the GRN is too large for the available expression data, we extract a small number of subnetworks for testing instead.
Table 1

Information of the Time Series Real Data.

Series Raw Time Points (Min) Interpolated Time Points (Min)
alphaevery 7 mins from 0 to 119every 10 mins from 0 to 120
cdc1510, 30, 50, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 210, 220, 230, 240, 250, 270, 290every 10 mins from 10 to 290
cdc28every 10 mins from 0 to 160same time points
eluevery 30 mins from 0 to 390every 10 mins from 0 to 390
In the following, we first describe the performance metrics, and then the generation of synthetic expression data once the GRN is given, and then describe the generation of the synthetic GRN for different cases, and the results of the three types of synthetic data. After that, we describe the preprocessing of the YEASTRACT subnetworks and the expression data, and then present the results of our algorithm on the real data.

Performance Metrics

It is more appropriate to focus on correctly predicting the presence of links rather than its absence, due to the sparse nature of GRNs. We assess the performance of the inference algorithm on three aspects, namely Links (which is considered correct if and only if both the gene pair and the direction are correct), Delays (which is considered correct if and only if both the link and the time delay τ are correct) and Effects (which is considered correct if and only if both the link and the sign of the effect a are correct). For each aspect, we look at three metrics respectively, namely Recall , Precision and F-score , where TP is the number of true positives, FP is the number of false positives, FN is the number of false negatives. But since F-score is an overall measurement of performance, we focus on it. We assume that in the true GRN, the hidden nodes are labeled. But in the predicted GRN, the number and the indices of the predicted hidden nodes may not be the same as the true GRN. Therefore, we need to map the predicted hidden nodes to the true GRN before doing the above performance calculations. Also, for the links to/from the hidden common nodes, the delays and the effects cannot be completely determined because there is no observed time series data for the hidden nodes to constrain them. This is illustrated in Fig 6, where the sign of a hidden node can be flipped, and be compensated by a flipping the signs of all its links; and the delays of links out of a hidden node can be shifted, and be compensated by the same shift of links into the hidden node. Therefore, in mapping the predicted hidden nodes to true hidden nodes, we may need to shift the delays, and flip the signs of the links appropriately, for useful calculation of the performance.
Fig 6

Example of Flipping Signs of Links and Shifting of Delays for Hidden Node.

The number on the link is the delay, and + or − is the sign of the effect. Consistently flipping the signs and shifting the delays results in equivalent predicted GRN, as the hidden node is not observed.

Example of Flipping Signs of Links and Shifting of Delays for Hidden Node.

The number on the link is the delay, and + or − is the sign of the effect. Consistently flipping the signs and shifting the delays results in equivalent predicted GRN, as the hidden node is not observed. For each predicted hidden node, we try to align it to each of the true hidden nodes, and choose the one with the most matched links and delays (after shifting). And in case of ties, we arbitrarily choose true hidden node with the lowest index. When aligning a predicted hidden node, we consider only the links to/from observed genes, and consider all possible shifts in delay (for a shift of d, the delays in parent links will be increased by d, while that in children links will be decreased by d), and also whether to flip the signs of the links. How well it is aligned is measured by the sum of number of matched links and shifted delays. After the mapping of predicted hidden nodes (and shifting in delays and flipping of signs), the performance of the predicted GRN is calculated as described above.

Generation of Synthetic Expression Data

One Segment

Given a network as (a ,τ ) where a is the effect of gene i to gene j, and τ is the associated delay if a ≠ 0. Given the parameters σ 2 which is the variance of the error terms, and α which controls the gaussianity of error terms, we then generate the expression data with m time points as follows: For , 1 ≤ j ≤ n, set x (t) = e (t), where each e (t) is generated from N(0,1) For 1 ≤ t ≤ m, 1 ≤ j ≤ n, set , where e (t) = sign(z )s∣z ∣, where each z is generated from N(0,1) and s is used for scaling such that Var(e (t)) = σ 2 Output {x (t)}1 ≤

Multiple Segments

In order to generate K segments of time series data, we first randomly pick the lengths of each segment uniformly between 20 and 30 (inclusive), emulating the situation of multiple short time series, possibly with different lengths. Then we generate each segment of specified length as above.

Synthetic Small GRN with One Hidden Node

We first test our proposed algorithm on small GRN where there is only one hidden node, and the variance of the error terms is assumed known.

Network Generation

In this small GRN case, each GRN has one hidden node, p observed parents and c observed children, as illustrated in Fig 7, but when applying the algorithm, we do not assume the presence of hidden node. For each link, the delay is uniformly chosen from {1,…,τ 0}, where τ 0 = 4; and the coefficient is a = ρ z , where ρ is uniformly chosen from {−1,1} and z is uniformly chosen from (0.5,1.5). Then we permute the indices of the observed genes.
Fig 7

Small Synthetic GRN.

There are p parents, c children and one hidden node.

Small Synthetic GRN.

There are p parents, c children and one hidden node. We test a few values of the parameters, as shown in the column Small GRN of Table 2. For each setting of p, c, σ 2 and α, we generate 20 replicates, for a total of 6,400 GRNs. For the one segment case, for each replicate, we generate expression data with 200 time points, and then take prefix to get m time points, and output only the expression of the observed genes, for a total of 25,600 time series. And for the multiple segments case, for each replicate, we generate 8 segments, for a total of 51,200 segments, and for each setting, we test using K segments at a time.
Table 2

Parameter Settings of Synthetic Data Generation.

Parameter Small GRN Large GRN
Parents (p)0, 1, 2, 3
Children (c)2, 3, 4, 5
Observed genes (n) p + c 50, 100
Hidden nodes (n h)1 for hidden case, 0 for non-hidden case5 for n = 50, 10 for n = 100
σ 2 0.5, 1, 2, 3, 40.5, 1, 2, 3, 4
α 0.5, 1, 2, 30.5, 1, 2, 3
Maximum parents (M 0)4
Maximum delay (τ 0)44
Replicates2040
Time points (m)20, 50, 100, 20020, 50, 100, 200, 400, 800
Number of segments (K)1, 2, 4, 81, 2, 4, 8, 16, 32
σ 2 known?YesNo

Results

First we show that the performances of Links, Delays and Effects are usually very consistent, with occasional discrepancies. Fig 8 shows the profiles of F-scores of Links, Delays and Effects for different settings for small hidden case with one segment time series. The three F-scores are mostly consistent with each other, though occasionally there are greater discrepancies for some records. In order to illustrate that these cases are very small, we plot the histogram of the absolute difference between Links and Effects F-scores in Fig 9. The histogram for difference of F-scores between Delays and Effects and between Links and Effects are similar and are shown in Figures A to B in S1 File. This suggests that the Links, Delays and Effects are usually inferred correctly at the same time, rather than getting one correct but the other incorrect. Therefore, in the following we mainly show the results of Effects, for the lack of space, but the full results including Links and Delays are provided in additional files.
Fig 8

Profiles of F-scores of Links, Delays and Effects for Different Settings for Small Hidden Case.

The x-axis shows the records.

Fig 9

Histogram of Absolute Difference of F-scores of Links and Effects for Small Hidden Case.

Profiles of F-scores of Links, Delays and Effects for Different Settings for Small Hidden Case.

The x-axis shows the records. Next we consider the F-scores under different σ 2. If we repeat the above method of showing the profiles of F-score for different σ 2, we found that the F-score of individual record for different σ 2 show great discrepancies for some records. However, Fig 10 shows the boxplot of Effects F-score for different σ 2 for one segment case, which shows that the overall distribution of F-scores are quite similar. The boxplot for the multiple segment case is similar and is shown in Figure L in S1 File. Therefore, we mainly show the results for Effects F-score for σ 2 = 2 in Table 3 for one segment case and in Table 4 for multiple segments case, and the full results of median performance for small hidden case are shown in Table A in S1 File for one segment case and in Table D in S1 File for multiple segment case.
Fig 10

Boxplot of Effect F-score with Different σ 2 for Small Hidden Case.

Table 3

Median Effects F-scores for One Segment Small Hidden Case with σ 2 = 2.

α = 0.5 α = 1 α = 2 α = 3
pc m st2st3st4st2st3st4st2st3st4st2st3st4
02200.6670.6670.6670.5000.5000.5830.0000.0000.0000.0000.0000.000
501.0001.0001.0001.0001.0001.0000.5830.7330.7330.0000.0000.000
1001.0001.0001.0001.0001.0001.0001.0001.0001.0000.0000.0000.000
2001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
03200.6670.6670.5000.5830.7330.5000.0000.2000.5000.0000.2500.500
500.8330.9000.8001.0001.0001.0000.7330.8000.8000.4500.5360.536
1001.0001.0001.0000.8290.8290.9000.6670.6670.6670.7330.8000.800
2001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
04200.5000.6190.6670.6190.6190.6670.5710.6670.6670.1430.3330.367
500.8570.8570.8570.6670.6670.7500.6670.6670.6670.5000.5710.571
1000.7500.8040.8570.7500.8570.8570.7080.7500.8040.7080.7080.708
2001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
05200.5360.5710.3890.4220.5000.5360.4000.5360.5360.4000.4720.500
500.7640.8000.8440.7270.7500.8440.6330.6640.7270.4000.4000.586
1000.8000.9001.0000.7271.0001.0000.7270.7640.8000.6710.6640.697
2001.0001.0001.0000.9551.0001.0001.0001.0001.0000.9551.0001.000
12200.4000.4000.5000.4000.4000.6500.5360.6670.5000.0000.0000.000
500.8000.8000.8000.6670.7330.7330.0000.0000.0000.0000.0000.000
1001.0000.9000.9000.8000.8000.8000.6670.7330.7330.1430.1670.167
2000.8000.8000.8000.8290.8000.8001.0000.9000.9000.8570.8571.000
13200.5360.6190.6670.5710.5710.6670.5360.6190.5360.0000.0000.310
500.8730.8730.8570.6610.7500.7080.7080.6670.6670.5710.5710.571
1000.8730.9290.9290.8570.8570.8570.7500.7500.8040.5360.5710.619
2001.0001.0001.0000.8570.8570.8570.8570.8570.8570.8570.8570.857
14200.6670.6670.5710.5220.5000.5000.5450.6670.6670.1910.2160.250
500.6970.8440.8440.6000.7080.7500.7270.7330.6670.4720.5230.600
1000.9091.0001.0000.8000.8000.8440.8440.8440.8890.5450.6000.633
2000.9551.0001.0000.8890.8890.8890.9091.0001.0000.8890.8890.889
15200.5000.6000.6000.5030.6670.5500.2650.4000.4440.3640.3820.422
500.7480.8000.8000.6970.7480.7270.6670.6670.6670.4450.5230.481
1000.8830.9090.9090.7690.8330.8710.8330.8330.8330.7140.7210.727
2000.9090.9090.9090.9621.0001.0000.9230.9090.9090.8710.9090.909
22200.3100.3330.3670.0000.1430.6190.2500.2860.3670.0000.2920.367
500.5360.5710.6670.0000.0000.0000.3330.6190.6190.1430.1430.143
1000.5710.6670.6190.5710.5710.6190.6670.6670.6670.4170.4170.452
2000.7080.7080.6670.7500.7500.8040.8570.8570.8570.7500.8570.857
23200.6330.6670.5710.5000.5710.6610.5080.5000.5000.2220.2860.286
500.6670.7080.7080.5730.5220.5830.4440.4720.5000.0000.1110.450
1000.6970.7750.7750.7750.7500.7500.7270.7500.7500.5430.5450.523
2000.8170.8890.8890.7750.7750.7750.8000.8890.8890.8000.8890.844
24200.5450.6330.6000.3960.5450.4530.0910.0910.2910.3210.3640.382
500.6970.7270.7270.5000.5450.5450.3480.2820.3820.4370.4810.481
1000.8330.8330.8330.6330.7480.7480.6060.6060.6060.6410.5860.633
2000.8710.9090.9090.8330.9090.9090.8170.8710.9090.8170.9090.909
25200.5000.6150.5800.4620.5800.5450.4810.5230.4440.3480.4140.400
500.6670.6410.6670.7080.7690.7420.5710.5930.5800.2680.2970.321
1000.8290.8290.8570.7420.7690.7690.5580.5000.5450.4500.5020.571
2000.8570.9230.9230.8620.8900.9230.8570.9230.9230.7850.8570.857
32200.3470.5000.5000.2500.2860.3330.2860.2860.3100.0000.1110.268
500.5710.5360.5710.4720.5710.5710.4000.4220.4440.0000.0000.000
1000.5000.5710.5710.5230.5710.5360.4720.4170.4170.1000.0000.000
2000.6670.6670.6670.6330.6670.6330.7500.7500.7500.7390.7500.750
33200.4220.4720.5000.3820.3650.4220.4000.4000.4000.0000.2000.211
500.5860.5730.6000.3820.4220.4220.5730.6000.6000.4000.4000.422
1000.7210.7270.7270.5000.6000.5500.5310.6330.6670.3970.4530.422
2000.7690.8000.7640.7210.7270.7270.8170.8000.8000.7690.8000.764
34200.4140.4810.5230.3480.3640.4000.3820.4220.4000.3210.2970.279
500.6670.6670.7270.5710.6670.6670.4290.4620.4620.4000.4620.431
1000.7140.6900.6900.7420.7690.7690.5930.6150.6150.4380.4620.413
2000.7850.8170.7850.7690.7690.7690.7140.8330.8330.7690.7690.769
35200.4340.5580.5030.4450.4810.4620.3690.4140.4000.3210.3330.348
500.6460.6670.6670.6250.5930.6410.4710.5710.5360.3880.3880.429
1000.6860.7100.7140.6810.7140.7140.4730.5000.5170.5410.5880.561
2000.7500.8000.8000.8400.8660.8570.8660.8660.8660.8000.8750.866

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively.

Table 4

Median Effects F-scores for Multiple Segments Small Hidden Case with σ 2 = 2.

α = 0.5 α = 1 α = 2 α = 3
pc K m st2st3st4st2st3st4st2st3st4st2st3st4
021201.0001.0000.6670.9001.0000.6670.6670.6670.8330.5001.0001.000
2431.0001.0001.0001.0001.0001.0000.9001.0001.0000.5000.5000.833
4941.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
82011.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
031300.6670.8000.8000.8000.8000.8000.7330.8000.8000.6670.8000.800
2510.8000.8000.8000.7330.8000.8000.6670.6670.6670.7330.8000.900
41000.8290.8000.8001.0001.0001.0000.8330.8000.8000.8000.7330.800
81991.0001.0001.0000.9291.0001.0000.9290.9000.9000.7750.7330.900
041210.5710.6670.6670.7080.7500.6670.6670.6670.8040.5710.6670.619
2480.7500.7500.8570.8570.8570.8570.7500.7500.8040.7500.7500.857
41000.8040.8570.8570.8570.9441.0000.9441.0001.0000.7620.8570.857
82040.8891.0001.0001.0001.0001.0001.0001.0001.0000.8890.9290.929
051250.4950.6190.5710.6330.6330.5710.5500.6330.7500.5450.7390.750
2540.8000.8000.8890.8080.8440.8890.6970.8440.8890.8000.8000.844
41100.7640.8000.8440.8170.8890.9440.7270.8440.8890.8000.8000.889
82050.8440.8440.9440.8440.9090.9090.7390.8000.8000.7750.8000.800
121260.7330.8000.8000.6860.8000.8000.4500.5000.5830.0000.2000.400
2510.8000.8000.8000.8000.8000.8000.3330.3670.4000.1670.3670.367
41090.8000.8000.8000.8000.8000.8000.6670.7330.8000.6670.6670.733
81950.8000.8000.8000.7330.8000.8000.6670.7330.8000.3330.2860.333
131230.7500.7620.6670.6190.6670.6670.6750.8040.8570.6670.5710.667
2480.6670.7080.7080.7500.8570.8570.5710.5710.5710.8040.8570.857
4960.7500.7500.7500.8570.8570.8570.8040.8570.8570.8040.8570.857
82010.7080.7080.7500.7500.7500.7500.7500.8040.8040.6610.7290.762
141290.7750.8890.8000.6670.7500.7500.7270.7390.7080.6670.6670.633
2500.8440.8440.8890.7080.7080.7080.7330.7330.7080.6330.6670.708
4990.8440.8890.8890.6970.7500.7750.6640.7640.8000.7270.8000.800
81910.8000.8890.8440.8000.8890.8440.8000.8000.7640.6640.8000.800
151280.8170.8330.8000.7270.6670.6970.7480.8000.8000.7270.6970.697
2580.8330.8710.9090.8330.8170.8170.8010.9090.9090.8010.9090.909
41050.9090.9090.9090.9090.9090.9090.8390.8770.9090.7850.8330.833
82040.9090.9090.9090.8660.9090.9090.8710.9090.9090.7690.8710.909
221270.4000.6190.6670.5710.6190.6670.3100.4000.5710.3330.4000.400
2540.6670.6670.6670.0000.5710.5710.5710.6190.6670.6670.6670.667
41080.5710.6670.6670.5710.6190.6190.7500.7500.6670.6190.6670.667
82160.6670.6670.6670.6190.6190.6190.7500.7500.7500.7080.7080.667
231210.5860.6190.5710.4720.5710.5710.5000.5710.5000.4440.5000.472
2430.6330.6330.6190.6670.6670.7080.5000.5000.5000.3760.4040.472
4980.6000.5860.5860.5500.6000.6000.4220.4720.4720.5220.5560.556
81810.6970.7500.7080.5000.5710.5710.6670.6670.6190.6260.5000.550
241300.6330.6640.6670.7270.7270.8000.6080.6330.6670.4000.4950.600
2580.6410.6670.6970.7480.8000.8000.6410.6970.7270.7480.7270.727
41020.6410.7270.7270.7210.7480.7480.7180.7270.7270.6650.6670.697
82090.6300.6650.6900.6670.7850.8000.7480.7270.7270.6650.6970.697
251220.6150.6670.6670.6150.5860.5730.4450.5000.6000.2860.3540.422
2450.6410.7140.6900.7140.7690.7690.6150.6650.6650.4370.4620.464
4930.6900.7140.7690.6900.7140.7140.7690.7690.7690.3540.4140.381
81960.5930.6410.6670.6020.6150.6670.7690.7690.7690.4640.5330.558
321230.5000.5710.4520.4720.5710.5710.2500.3330.3330.3430.3100.333
2500.5000.5000.5710.5000.5360.5710.2500.2500.2680.4040.3100.393
41060.2000.2220.2500.5360.5360.5710.4220.3890.4720.4040.2500.250
82020.4220.4440.4440.5000.5000.5000.2000.2860.2860.5360.5000.536
331280.4500.6000.6330.5450.6330.6670.5730.6000.5000.3820.4000.444
2480.5730.5450.5230.5730.6000.6330.4810.4220.4000.4000.4000.400
4980.5860.6000.6000.3960.5230.4730.5450.5730.6000.4720.5730.500
81940.5500.4730.4730.4810.5030.5030.5730.6330.6330.5730.5730.573
341250.5230.5450.5450.5840.5520.5520.5000.5000.5450.3540.4450.495
2530.4640.4730.4730.5000.5230.5450.5360.5580.5800.4290.4620.500
4970.5000.5230.5450.5710.6410.6410.6410.6670.6670.5170.5450.558
81880.5330.5580.5800.4310.4620.5450.5710.5980.6200.4330.4660.481
351230.5170.5930.5930.5330.5710.5930.5980.6150.6150.4500.4660.462
2520.4310.4450.4620.4450.4810.4810.5520.5520.5710.3880.4310.481
41050.5880.6200.6410.5900.6900.7140.6200.6900.7690.5560.6410.615
81990.5880.7320.6900.4810.6200.6460.6460.7320.7600.5330.5560.556

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used.

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used. The results of the small case with one hidden node have no clear trend for either score threshold, α or the number of time points m. We would expect that the performance is better with larger m, but there are some cases where this is not the case. One possible reason is that for this case, the algorithm need to correctly detect the presence of hidden common node, and infer the link(s) between the hidden node and the observed genes, so the performance has more variability. For example, if the algorithm incorrectly predicted that there are no hidden node, then the F-score will surely be 0, as in the true GRN, all links are to/from the hidden node. For α, there are some differences between α = 0.5,1 and 2, but the differences are not great. For α = 3, the algorithm seems to have worse performance when m is small. However, note that when m is large, the F-scores are quite satisfactory for one segment case, and in many settings are over 0.9 or even 1. For multiple segments case, for p = 2,3, the results are a bit worse, especially for α = 3, but can still reach 0.6 in F-score. The results show that the proposed algorithm can adequately detect and estimate the hidden node in a simple setting, and is robust towards slight deviation of gaussianity in the error terms.

Synthetic Small GRN without Hidden Node

In this case, we test the effectiveness of the algorithm when there are in fact no hidden node. And in applying the algorithm, we do not assume the number of hidden nodes is known, but the variance of the error terms is known. The parameters are same as above, which is shown in the column Small GRN of Table 2. For this purpose, we generate some “confusing” cases as follows. For each GRN (p, c, σ 2, α and replicate) in the above cases of small GRN with one hidden node, we use the (incomplete) data of 200 time points, and use CLINDE (with score threshold of 2 using PCor) to infer an GRN, which is definitely wrong, as CLINDE does not handle hidden nodes, and there are no true links among observed genes. If this GRN is non-empty, it is used; otherwise, a small GRN of a node in the middle with p parents, and c−1 children is generated as above, but all genes are labeled as observed. Having obtained the 6,400 GRNs without hidden nodes, the same method as above is used to generate the 25,600 time series data for one segment case, and 51,200 segments for multiple segments case. Similar to the small case with hidden node, the F-scores of Links, Delays and Effects are fairly consistent, so we omit the profiles and histograms here, but they are in Figures C to F in S1 File. Also, the situation is similar for the F-scores with different σ 2, and we show the boxplot in Figure J in S1 File for one segment case, and in Figure M in S1 File for multiple segments case. Therefore, we show the results of Effects for σ 2 = 2 in Table 5 for one segment case, and in Table 6 for multiple segments case, and the full results of median performance is shown in Table B in S1 File for one segment case and in Table E in S1 File for multiple segments case.
Table 5

Median Effects F-scores for One Segment Small Non-Hidden Case with σ 2 = 2.

α = 0.5 α = 1 α = 2 α = 3
pc m st2st3st4st2st3st4st2st3st4st2st3st4
02200.0000.0000.0000.0000.0000.0000.0000.0000.0000.5000.0000.000
501.0000.0000.0001.0000.0000.0000.0000.0000.0001.0001.0000.000
1001.0001.0001.0001.0001.0001.0001.0001.0000.0000.3330.0000.000
2001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
03200.5000.0000.0000.5000.0000.0000.1670.0000.0000.5000.0000.000
500.7330.6190.0000.6670.6670.0000.6670.1670.0000.5830.0000.000
1000.8000.8000.8001.0001.0001.0000.6670.6670.6190.8000.8000.583
2001.0001.0000.9001.0001.0001.0001.0001.0001.0001.0001.0001.000
04200.3330.0000.0000.0000.0000.0000.3330.0000.0000.3330.3330.143
500.8000.6670.1250.6190.3100.0000.5830.4000.1430.4440.4440.367
1000.8570.8290.8000.7500.7500.6670.6670.6670.5230.6670.5360.536
2001.0001.0001.0001.0001.0000.8731.0000.9290.8570.8890.8890.889
05200.3250.0000.0000.2360.0000.0000.3640.0000.0000.3330.2680.222
500.5710.5000.3640.5000.4220.2920.4810.3670.2540.5000.4720.348
1000.7750.7500.6670.6970.6330.5390.6410.4720.3650.5860.5710.400
2000.9160.9160.8040.8570.8450.8000.8730.8730.8570.8990.9090.909
12200.4500.0000.0000.5000.0000.0000.0000.0000.0000.4000.4500.250
501.0001.0000.6670.7330.8000.6670.6670.6670.2000.5830.5000.450
1001.0001.0001.0001.0001.0001.0001.0001.0001.0000.7330.6670.667
2001.0001.0001.0001.0001.0001.0001.0001.0001.0000.9291.0001.000
13200.5000.4220.2860.3330.0000.0000.4500.3330.0000.4220.1670.143
500.8000.7750.6330.7750.6190.3330.5230.4000.3100.5710.5000.500
1000.8570.8570.8571.0001.0000.9290.7330.6670.6670.5710.5710.536
2000.8890.9440.8891.0001.0001.0001.0000.9440.8571.0001.0000.929
14200.2360.0000.0000.2110.2110.0000.4220.1000.1110.3480.4000.268
500.5230.3820.2780.6000.5580.2360.5450.4810.3820.5000.4530.404
1000.8000.7750.7750.8000.8170.7500.7390.6670.5580.6080.6410.593
2000.8890.8730.8570.8890.8890.8890.9551.0000.8890.8730.8990.857
15200.3640.1820.0000.3820.2110.0000.3360.2680.1670.3210.3480.000
500.5520.3820.3210.6200.5000.3640.4380.3640.3640.5520.5450.523
1000.7270.7170.6250.7500.7270.6670.5580.5000.5000.5930.5000.500
2000.8450.9160.6970.9160.8990.8290.8830.9090.7320.8330.8570.829
22200.0000.0000.0000.5830.1670.0000.0000.0000.0000.4000.3100.000
500.6670.5330.4000.7330.6670.6670.7330.5000.4000.5000.5000.333
1000.8290.9290.6671.0001.0000.9000.8290.9290.8000.7080.8000.775
2000.8291.0001.0000.9291.0001.0001.0001.0001.0000.9441.0001.000
23200.3100.0000.0000.4000.2860.0000.4000.3330.0000.3670.2860.000
500.6970.6670.4720.6970.6670.5710.6670.5230.3670.6670.5860.472
1000.8570.8570.8000.8570.8570.8290.7500.7080.6670.6670.5000.500
2000.8990.9090.8570.8730.9160.9440.8730.9440.8890.8890.8890.857
24200.4040.1110.0000.5000.3480.0000.2360.2500.1000.2860.3080.222
500.6410.5860.5360.5230.5230.4220.6000.5230.3100.4720.4720.422
1000.8610.8450.7500.6670.6670.6460.6330.6330.6330.6670.6020.561
2000.9230.9160.9620.8330.8450.8450.8890.8990.8610.8330.8330.833
25200.2260.1600.1440.3100.1740.0000.2360.2250.1430.3330.3330.250
500.6150.4290.3580.5230.5230.4000.5170.5170.4100.5630.4810.333
1000.7660.6830.6490.6670.6650.5930.5390.5630.5630.6460.6250.497
2000.8570.8450.7750.8750.8570.8450.8330.7140.6670.8400.8120.769
32200.3330.0000.0000.0000.0000.0000.3100.0000.0000.2680.0000.000
500.5710.5000.5000.5710.2500.0000.3820.3820.3100.4170.3330.367
1000.6670.8000.7330.6670.6670.6670.6330.5710.5860.5000.4000.333
2000.8571.0000.8570.8001.0000.8290.8730.8830.8570.8450.8570.829
33200.2500.0000.0000.3210.0000.0000.3330.1430.0000.2360.2110.000
500.6330.5450.3890.5000.5230.4220.4720.4440.4440.4140.4310.400
1000.8570.8890.8570.7390.7080.6670.6410.5860.4720.5580.5730.481
2000.9091.0000.8990.8890.8890.8890.8730.8890.8890.8170.8290.775
34200.3880.1000.0000.2680.2220.0000.2360.2860.2020.3210.2360.222
500.5000.5000.4620.5330.5520.4530.5860.5000.4720.4000.3480.297
1000.7270.6970.6080.7390.7270.6330.6410.6080.5450.6080.5230.500
2000.8280.8170.8000.9090.9090.8830.8570.8890.8570.7750.7500.708
35200.2670.1360.0000.2760.2760.0000.3210.2110.1680.2970.1830.121
500.5000.3640.2580.4710.3430.2860.4620.3530.2860.4660.4370.354
1000.7100.6670.6270.5880.5860.5440.6200.5170.5170.5560.5170.429
2000.8120.8570.8120.8660.8120.6670.8240.8380.7180.8000.8060.750

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively.

Table 6

Median Effects F-scores for Multiple Segments Small Non-Hidden Case with σ 2 = 2.

α = 0.5 α = 1 α = 2 α = 3
pc K m st2st3st4st2st3st4st2st3st4st2st3st4
021281.0000.0000.0000.3330.0000.0001.0000.0000.0000.3330.0000.000
2501.0001.0000.0001.0000.0000.0001.0000.8330.0001.0001.0000.833
41101.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
82111.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
031200.3330.0000.0000.0000.0000.0000.0000.0000.0000.0000.0000.000
2430.6190.4170.3330.6670.0000.0000.6670.0000.0000.6670.6670.583
4931.0000.8000.8001.0001.0000.8001.0001.0000.8001.0001.0001.000
81971.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
041200.3640.0000.0000.3430.0000.0000.0000.0000.0000.2860.0000.000
2410.6670.3330.0000.6670.4000.0000.4500.3330.0000.4220.4440.000
4891.0000.8570.6670.8290.6670.6670.6670.5710.5230.7620.7330.500
81971.0001.0001.0001.0000.8730.8570.8570.8570.8570.8890.9440.857
051270.5000.2000.0000.2500.0000.0000.4000.0000.0000.3100.2360.000
2510.6410.5000.2680.6670.3330.1820.5580.5000.2220.5450.3820.222
4960.8000.6670.4870.8000.6670.5580.7750.6670.6670.7850.6970.619
81941.0000.9620.8730.8330.8570.8170.8890.8570.8570.8570.8990.844
121210.6670.6670.0000.3330.0000.0000.0000.0000.0000.4000.0000.000
2441.0001.0000.8000.9001.0000.6670.6670.5000.4500.5000.5000.333
4941.0001.0001.0001.0001.0001.0001.0001.0001.0000.7330.6670.667
81961.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0000.800
131270.6190.4720.2920.5000.3100.0000.5360.2000.0000.4440.0000.000
2540.8000.8000.8000.8000.8000.6670.6000.6190.5360.5710.5710.500
41000.8570.8570.8290.8571.0000.8570.8290.8000.8000.8570.8570.733
82110.8890.8730.8731.0001.0001.0000.8570.9290.8571.0001.0000.857
141290.5860.3480.0000.5000.3820.0000.5000.2930.0910.3820.2110.000
2550.7500.7320.5730.6670.5000.5000.5000.5450.4810.4970.3210.222
4970.7750.7750.7500.7750.6970.6670.8570.7750.6970.6970.5860.481
82020.8570.8570.8570.8450.8890.8730.9090.8990.8890.8570.8130.817
151210.3480.1670.0000.2680.0770.0000.2860.2220.0000.2760.2110.000
2440.5170.4000.3010.4450.4000.2860.4450.3100.3670.5000.4530.388
4890.6670.6200.5930.6670.5580.4810.5580.5170.4000.6410.5710.458
81870.8450.7270.7270.8730.8820.6900.8330.7320.6460.8000.8330.764
221200.4000.0000.0000.4000.0000.0000.3100.0000.0000.3330.0000.000
2440.6670.5000.3670.7330.6670.5360.6670.5830.0000.6670.5000.500
4960.8290.8000.6671.0001.0000.8000.7500.7330.6670.8571.0000.900
81931.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.0001.000
231220.5000.4000.0000.4000.0000.0000.3100.0000.0000.4220.2680.310
2480.7500.6670.4720.8000.7750.5830.5360.4000.3670.5000.4440.444
41030.8290.8570.8000.8570.8730.8290.7080.7080.6670.6970.7080.633
81970.8890.8890.8570.8570.8890.8730.8570.8000.8000.8000.8000.739
241280.5000.2680.0000.3820.3670.3670.4220.3480.2500.4370.3430.200
2550.7390.5360.4810.6670.5730.5330.6410.4810.4000.5170.5000.462
41020.8170.8000.7390.7640.7480.6710.7140.6670.6080.5670.5330.481
82000.8730.8890.8990.8890.8330.8330.8730.8990.8170.7850.7330.733
251280.3640.2580.1480.4000.2870.1670.3360.1820.2110.3300.1670.138
2580.6000.4660.4290.6110.5000.4810.5020.4020.3360.5170.5170.445
41130.7140.7500.5980.7850.7740.6460.6860.6020.5280.5800.5390.558
82160.8750.8660.7680.8170.8570.8170.8280.8170.7750.7060.7060.667
321200.1430.0000.0000.1110.0000.0000.1110.0000.0000.1110.0000.000
2400.6670.5000.4000.4000.3330.0000.1250.2860.0000.3330.3100.167
4910.8570.9290.6670.7080.6670.5000.7080.5000.4720.5710.5500.500
81981.0001.0001.0000.8730.8570.6670.8730.8000.7080.8000.8000.800
331280.4220.4000.2500.5580.4000.2860.4000.0000.0000.2670.2220.000
2490.7270.6670.4000.5710.5230.4440.4440.4440.2500.3820.3480.333
4940.7080.8000.6670.6970.7500.6670.6670.7270.6670.5580.5000.422
81880.8451.0000.8890.8730.8570.8170.8570.8890.7500.7390.7210.667
341220.3330.2220.1110.3480.2500.0000.3330.2080.0000.2860.1110.000
2470.5710.5000.4000.5860.5170.4370.5440.4810.4000.4000.4290.388
41050.7690.6670.5940.7480.7270.6670.7210.7640.6080.6150.5440.466
82070.8330.8330.8000.9090.9090.8330.8890.8890.8330.7690.7690.760
351240.3210.2500.1330.3010.2670.0710.4000.3080.1480.2430.1940.183
2470.5000.4710.3810.4530.3210.2970.3750.4290.2860.4290.3480.297
4930.6900.5710.5410.6150.4850.4020.7170.5780.5130.5330.5020.450
81970.8450.8450.7530.7780.6830.5880.8120.8450.8010.7750.6260.602

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used.

The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. The medians are taken over the 20 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used. For this case, when m is larger, the F-score is usually higher, although there are some exceptions, especially for α = 3 and p = 0. Also, small score thresholds usually work well, especially when m is small. The F-score is often 0 for small m with high score threshold, which means the score threshold is too stringent. When m is larger, using a more stringent score threshold also gives good F-score, but is only occasionally better than small score threshold. The F-scores can reach over 0.9 or even 1 in many settings, and over 0.6 in all settings. Similar to the above case, the performance does not show great difference for different α, though small α is usually slightly better. The results show that the proposed algorithm works satisfactorily even when there are no hidden node in a simple setting.

Synthetic Large GRN with More than One Hidden Node

The above two cases are for small GRN, and where the variance of the error terms is known. We also test the more realistic case of larger GRN with more than one hidden node (but the number is unknown), and that the variance of the error terms is unknown, but estimated by the algorithm. For a network with n observed genes, we would generate hidden nodes. For n observed genes and n hidden nodes, a maximum of M 0 parents for observed genes, maximum of τ 0 as delay, we generate a GRN with the structure shown in Fig 11 as follows:
Fig 11

Large Synthetic GRN.

Each hidden node has up to 3 distinct parents, and up to 5 distinct children.

Choose the number of parents (all distinct) for the hidden nodes: first for each i ∈ {0,1,2,3}, assign nodes to have i parents; for each of the remaining nodes, randomly choose i from {0,1,2,3} as the number of parents. Similarly choose the number of children (all distinct and distinct from the parents) for the hidden nodes: first for each i ∈ {2,3,4,5}, assign nodes to have i children; for each of the remaining nodes, randomly choose i from {2,3,4,5} as the number of children. Let N and N be the total number of parents and children of hidden nodes respectively. Let N = n−N −N (n should be large enough relative to n so that ) For 1 ≤ i ≤ N , randomly choose p from {0,…,M 0}, and randomly link p parents from {1,…,n} to node i. For , randomly choose p from {0,…,M 0}, and randomly link p parents from {1,…,n−N } to node i. For the hidden nodes, link from distinct parents from {N + 1,…,N + N }, respecting the number of parents chosen above. Similarly link to distinct children for the hidden nodes from {N + N + 1,…,n}, respecting the number of children chosen above. For each link L, randomly set the coefficient as a = ρ z , where ρ is uniformly chosen from {−1,1} and z is uniformly chosen from (0.5,1.5). For each link L, uniformly choose τ from {1,…,τ 0}. Since the GRN may have cycles, scale the coefficients to make the GRN stable. Lastly, randomly permute the indices of the observed genes {1,…,n}.

Large Synthetic GRN.

Each hidden node has up to 3 distinct parents, and up to 5 distinct children. The parameters that we have tested are listed in column Large GRN of Table 2. For each setting of n, σ 2, α, we randomly generate 40 replicates as described above, for a total of 1,600 GRNs. Then for each GRN, for the one segment case, we generate expression data with m = 800 time points, and take prefixes to get different number of time points. So there are a total of 9,600 time series. And for the multiple segments case, we generate 32 segments for each replicate, for a total of 51,200 segments. We test more time points because now the number of genes is larger than the above two small cases. Fig 12 shows the profiles of F-scores of Links, Delays and Effects for different settings for large case, which shows that the three are very consistent, so we omit the histograms of absolute differences here, but they are in Figures G to I in S1 File. In addition, Fig 13 shows the profiles of Effects F-scores for different σ 2 and Figure K in S1 File shows the corresponding boxplot for one segment case, and Figure N in S1 File for multiple segments case. Both show that for large case, the performance is very consistent for different σ 2. Therefore, in the following we show the results of Effects F-scores for σ 2 = 2 in Table 7 for one segment case, and in Table 8 for multiple segments case, where we show the performances of complete: CLINDE on the complete data (i.e. with also the expression of the “hidden” nodes), hidden: the proposed algorithm on the incomplete data (i.e. only the expression of observed genes), and hiddenCL: CLINDE on the incomplete data. Note that the complete case is mainly for comparison purpose, as in real-life, the hidden variables cannot be observed. It is unsurprising if the complete case has better performance, as CLINDE has more data in this case, so it is the comparison between hidden and hiddenCL that is of main interest here. The full results of median performance are shown in Table L in S1 File for one segment case and Table F in S1 File for multiple segments case. In Tables 7 and 8, we also show the ratio of F-score of hidden to that of complete, in percentages.
Fig 12

Profiles of F-scores of Links, Delays and Effects for Different Settings for Large Case.

The x-axis shows the records.

Fig 13

Profiles of Effects F-scores for Different σ 2 for Different Settings for Large Case.

The x-axis shows the records.

Table 7

Median Effects F-scores for One Segment Large Case with σ 2 = 2.

complete (C)hidden (H)H/ChiddenCL
n n h α m st2st3st4st2st3st4st2st3st4st2st3st4
5050.5200.1230.0880.0210.0970.0550.00978.7%63.0%45.0%0.0810.0460.000
500.4130.3970.2790.3230.2690.17178.3%67.9%61.4%0.2940.2660.180
1000.6590.6670.6080.5320.5170.44680.7%77.6%73.5%0.4760.4920.438
2000.7880.8470.8150.6560.7150.66783.3%84.4%81.8%0.5840.6510.622
4000.8470.9330.9330.7370.8080.81687.1%86.6%87.5%0.6160.7030.718
8000.8500.9660.9790.7690.8780.89390.5%90.8%91.3%0.6000.6980.724
5051200.1180.0780.0390.1070.0620.02290.9%78.8%56.0%0.0780.0540.021
500.3990.3740.2740.3100.2820.19177.7%75.5%69.7%0.2880.2750.183
1000.6420.6540.5960.4930.5000.43276.8%76.5%72.5%0.4810.5000.438
2000.7760.8550.8170.6520.6940.66784.0%81.2%81.6%0.5920.6550.632
4000.8340.9310.9340.7240.8120.80686.8%87.2%86.2%0.6160.7110.719
8000.8470.9650.9750.7540.8670.88189.0%89.9%90.4%0.6030.7050.730
5052200.1360.1200.0660.1160.1040.05485.2%86.4%81.7%0.0900.0730.040
500.3850.3840.3000.2980.2840.21077.5%73.8%70.0%0.2820.2630.196
1000.6070.6260.5750.4530.4570.40974.6%73.1%71.2%0.4500.4640.425
2000.7650.8310.8060.6010.6460.62678.5%77.7%77.6%0.5670.6320.626
4000.8200.9320.9330.6630.7600.76280.9%81.6%81.7%0.6000.6960.707
8000.8330.9590.9760.7310.8500.87087.8%88.6%89.1%0.5950.6860.719
5053200.1600.1460.1030.1310.1170.08482.0%80.2%81.5%0.1070.0860.058
500.3700.3410.3100.3040.2600.23582.3%76.3%75.7%0.2630.2460.220
1000.5460.5450.5050.4230.4090.37977.3%75.1%75.1%0.3960.3990.377
2000.6760.7260.7350.5300.5720.57378.4%78.9%77.9%0.5050.5580.569
4000.7460.8550.8820.6010.7030.71980.6%82.3%81.5%0.5540.6480.688
8000.7680.9060.9510.6900.8290.87289.9%91.4%91.6%0.5580.6680.706
100100.5200.0710.0600.0160.0750.0470.011106.6%78.0%68.9%0.0520.0380.011
500.3470.3640.2810.2590.2520.18174.8%69.1%64.3%0.2500.2510.196
1000.5970.6540.6020.4530.4950.43575.8%75.6%72.2%0.4240.4830.441
2000.7200.8330.8190.5950.6870.67682.6%82.5%82.6%0.5330.6260.625
4000.7810.9250.9330.6740.7980.81386.3%86.4%87.2%0.5740.6890.718
8000.7730.9480.9770.7000.8650.89290.5%91.2%91.3%0.5620.6890.723
100101200.0750.0590.0220.0780.0480.016104.0%81.9%71.0%0.0510.0370.012
500.3480.3570.2880.2650.2520.18776.3%70.7%65.0%0.2550.2510.194
1000.5930.6450.5920.4540.4830.44376.6%74.9%74.7%0.4310.4770.432
2000.7350.8380.8190.5960.6850.66981.1%81.8%81.6%0.5450.6270.628
4000.7810.9270.9340.6690.8010.81585.6%86.4%87.3%0.5670.6830.703
8000.7850.9530.9800.7020.8530.88889.4%89.6%90.5%0.5570.6840.721
100102200.1060.0950.0610.1060.0840.04199.3%88.1%68.2%0.0720.0620.031
500.3500.3660.3150.2550.2590.21872.6%70.9%69.3%0.2440.2510.212
1000.5630.6180.5870.4140.4540.42273.5%73.4%71.9%0.4120.4560.429
2000.7020.8140.8170.5420.6410.63077.2%78.7%77.1%0.5130.6010.619
4000.7610.9130.9370.6280.7600.78482.6%83.2%83.7%0.5410.6740.705
8000.7710.9470.9810.6860.8560.88789.0%90.4%90.5%0.5390.6750.717
100103200.1320.1260.1050.1200.1110.08791.4%88.1%82.4%0.0880.0790.061
500.3180.3120.2910.2560.2410.21880.7%77.3%75.0%0.2240.2110.188
1000.4700.4790.4770.3560.3750.35975.8%78.3%75.2%0.3410.3550.353
2000.5860.6660.6970.4590.5190.53978.4%78.0%77.3%0.4410.5010.531
4000.6530.7900.8490.5140.6360.68378.8%80.5%80.5%0.4840.5970.651
8000.6770.8580.9270.6050.7760.84389.4%90.4%90.9%0.4840.6180.688

The medians are taken over the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage.

Table 8

Median Effects F-scores for Multiple Segments Large Case with σ 2 = 2.

complete (C)hidden (H)H/ChiddenCL
n n h α K m st2st3st4st2st3st4st2st3st4st2st3st4
5050.51300.2130.1750.0940.1630.1190.06376.8%67.9%67.2%0.1420.1200.060
2590.4490.4530.3410.3390.3070.20675.5%67.8%60.3%0.3340.3020.220
41000.6220.6330.5580.4880.4670.40078.4%73.7%71.7%0.4560.4740.407
81960.7700.8160.7680.6230.6470.61480.9%79.3%80.0%0.5890.6140.585
164000.8670.9220.9050.7330.7820.77584.6%84.8%85.7%0.6310.7010.706
328120.8870.9680.9680.7540.8460.86585.1%87.4%89.4%0.6330.6970.722
50511300.1980.1750.0870.1660.1340.05883.7%76.3%65.9%0.1500.1270.051
2590.4550.4350.3340.3130.3090.23168.8%71.1%69.2%0.3290.3060.234
41000.6010.6130.5380.4810.4640.38879.9%75.8%72.1%0.4580.4610.386
81960.7740.8070.7590.6450.6600.61483.3%81.8%80.8%0.5840.6110.590
164000.8460.9160.9010.7340.7960.77886.7%87.0%86.4%0.6450.6940.697
328120.8730.9630.9690.7650.8480.85487.7%88.1%88.2%0.6310.7070.717
50521300.1990.1670.0960.1460.1180.05873.6%70.6%60.3%0.1390.1140.045
2590.4280.4170.3330.3310.2960.24177.5%71.0%72.4%0.3170.2940.233
41000.6000.6210.5470.4790.4670.40379.8%75.2%73.7%0.4560.4630.407
81960.7770.8090.7790.6190.6580.61479.6%81.4%78.8%0.5760.6120.600
164000.8510.9250.9140.7140.7820.76983.9%84.5%84.2%0.6250.6910.704
328120.8760.9660.9720.7360.8220.83184.0%85.1%85.5%0.6250.7060.725
50531300.1840.1520.1020.1550.1220.07484.6%80.2%72.5%0.1300.1040.062
2590.4090.3860.3080.3000.2780.21373.4%71.9%69.4%0.2930.2850.210
41000.5810.5950.5310.4440.4450.38576.4%74.8%72.5%0.4240.4380.393
81960.7440.7920.7540.6190.6390.60283.2%80.7%79.8%0.5670.6110.582
164000.8440.9070.8960.7070.7810.77283.8%86.1%86.2%0.6290.6980.695
328120.8770.9640.9640.7450.8210.84385.0%85.1%87.4%0.6280.7030.716
100100.51210.0710.0580.0210.0690.0510.01697.9%89.0%75.1%0.0510.0440.011
2480.3070.2990.2330.2320.2130.16075.5%71.3%68.8%0.2220.2130.158
4920.5300.5670.5180.3980.4200.36475.0%74.1%70.2%0.3890.4160.374
81880.7190.7890.7580.5740.6250.60679.7%79.3%80.0%0.5310.5940.581
163870.8080.9100.9020.6780.7720.76983.9%84.9%85.2%0.5950.6860.694
328110.8310.9600.9740.7200.8440.86086.7%87.9%88.3%0.6020.7030.724
1001011210.0880.0760.0300.0850.0650.02195.8%85.1%71.8%0.0600.0570.020
2480.3230.3350.2510.2320.2360.16271.8%70.5%64.6%0.2220.2230.158
4920.5500.5770.5200.4250.4340.36877.2%75.3%70.7%0.3990.4260.367
81880.7250.7920.7610.5780.6470.61179.7%81.7%80.4%0.5390.6000.580
163870.8070.9100.9070.6800.7700.77484.3%84.6%85.4%0.6030.6850.692
328110.8420.9590.9710.7280.8530.85986.5%89.0%88.5%0.6070.6980.721
1001021210.0830.0670.0360.0860.0550.029103.2%81.2%80.3%0.0590.0450.023
2480.2960.2950.2490.2210.2140.16474.6%72.6%65.8%0.2180.2050.155
4920.5290.5590.5180.4040.4220.38276.3%75.5%73.7%0.3740.4090.377
81880.7110.7800.7610.5780.6260.60281.4%80.3%79.1%0.5290.5860.577
163870.8060.9140.9060.6690.7550.75983.1%82.6%83.7%0.5880.6740.688
328110.8360.9590.9770.7150.8390.85785.4%87.6%87.7%0.5910.6880.716
1001031210.0720.0670.0440.0750.0580.029103.9%87.1%66.7%0.0480.0440.026
2480.2650.2500.2120.1970.1870.14474.2%74.9%67.9%0.1890.1770.141
4920.4880.5210.4910.3610.3710.33973.9%71.1%69.0%0.3480.3720.347
81880.6850.7420.7430.5440.5930.57779.4%79.9%77.6%0.4970.5540.562
163870.7780.8850.8910.6410.7380.74182.4%83.4%83.1%0.5650.6600.685
328110.8170.9490.9680.6930.8100.84084.8%85.4%86.8%0.5780.6910.717

The medians are taken over the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage.

Profiles of F-scores of Links, Delays and Effects for Different Settings for Large Case.

The x-axis shows the records.

Profiles of Effects F-scores for Different σ 2 for Different Settings for Large Case.

The x-axis shows the records. The medians are taken over the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage. The medians are taken over the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage. Generally, the F-score is better for larger m, however, note that even with m = 800 or K = 32 on the complete data, CLINDE cannot infer the whole GRN, because the GRNs are not small. For each of complete, hidden and hiddenCL, for α = 0.5,1 and 2, the F-scores are quite similar, though the F-scores are usually worse for α = 3 for one segment case; for multiple segments, the results are similar for α = 0.5,1,2, and 3. This shows the robustness of all three methods towards slight deviation from gaussianity for the error terms, though for larger deviation, the difference in performance is more noticeable. As for the score threshold, when m is small, smaller score threshold is better, and for larger m, larger score threshold is better. When comparing between complete, hidden and hiddenCL, we see that in almost all settings, complete is better than hidden, which in turn is better than hiddenCL. The exceptions are all for m ≤ 100 or K ≤ 4. Note that hidden, using incomplete data, is usually able to achieve 70% to 80% of the performance of complete, and can get up to about 90% with large m. We have also performed one-sided Wilcoxon signed-rank test to test whether the median Effects F-score of hidden is better than hiddenCL. The p-values for σ 2 = 2 are shown in Table 9 for one segment case and in Table 10 for multiple segments case. And the p-values for other σ 2 are shown in Table C in S1 File for one segment case and in Table G in S1 File for multiple segments case. We see that for most settings, the p-values are very small, but occasionally the p-values are a bit larger for m ≤ 200 or K ≤ 4.
Table 9

P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the One Segment Large Case with σ 2 = 2.

n α m st2st3st4
500.5201.85082E-094.27927E-045.91254E-02
501.22187E-061.97777E-027.05012E-01
1001.25890E-074.13188E-034.55023E-01
2009.09495E-136.36646E-125.21868E-09
4009.09495E-139.09495E-139.09495E-13
8009.09495E-139.09495E-139.09495E-13
501203.64207E-081.03680E-041.36338E-02
501.76024E-031.84770E-026.32721E-01
1004.21831E-041.47409E-019.18072E-01
2003.00133E-114.70440E-081.13534E-05
4009.09495E-139.09495E-131.81899E-12
8009.09495E-139.09495E-139.09495E-13
502203.91992E-091.34587E-082.29484E-06
501.78595E-041.03773E-024.49752E-03
1009.21563E-026.12502E-018.15900E-01
2003.38990E-065.88649E-031.47409E-01
4001.72804E-119.09495E-132.30102E-10
8009.09495E-139.09495E-139.09495E-13
503204.06544E-109.09495E-131.92938E-07
501.18116E-083.94010E-078.41174E-04
1002.84941E-042.22241E-031.62600E-01
2001.02952E-042.32674E-031.94996E-01
4001.27329E-116.36646E-121.18116E-08
8009.09495E-139.09495E-139.09495E-13
1000.5203.00133E-115.60030E-045.22645E-02
503.95898E-031.84770E-029.97212E-01
1002.14964E-071.78595E-049.97084E-01
2009.09495E-131.81899E-125.82077E-10
4009.09495E-139.09495E-139.09495E-13
8009.09495E-139.09495E-139.09495E-13
1001201.81899E-123.23034E-081.06658E-01
506.77883E-031.18525E-019.10652E-01
1001.33686E-062.78850E-034.39254E-01
2001.27329E-119.09495E-131.74014E-08
4009.09495E-139.09495E-139.09495E-13
8009.09495E-139.09495E-139.09495E-13
1002202.27374E-113.91083E-114.10137E-08
503.13413E-052.84941E-041.31949E-03
1009.62277E-031.18525E-019.99385E-01
2006.00448E-093.03602E-076.83742E-04
4009.09495E-139.09495E-132.72848E-12
8009.09495E-139.09495E-131.85347E-08
1003209.09495E-136.36646E-129.09495E-13
502.51475E-098.22183E-104.10137E-08
1005.05048E-051.22426E-055.76843E-03
2001.09601E-043.23034E-083.76574E-03
4001.26774E-076.36646E-114.87489E-10
8009.09495E-139.09495E-139.09495E-13

The tests are based on the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively.

Table 10

P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the Multiple Segments Large Case with σ 2 = 2.

n α K m st2st3st4
500.51301.38532E-033.32451E-029.21645E-02
2591.97233E-026.95747E-017.61896E-01
41004.79873E-061.13695E-019.78903E-01
81968.00355E-115.81749E-085.81073E-07
164009.09495E-139.09495E-139.09495E-13
328129.09495E-139.09495E-139.09495E-13
5011306.77883E-036.05550E-031.17538E-01
2592.17906E-012.61704E-013.52344E-01
41006.10535E-062.18740E-021.56701E-01
81961.27329E-112.53194E-085.83590E-04
164009.09495E-139.09495E-132.72848E-12
328129.09495E-139.09495E-139.09495E-13
5021301.24866E-021.35190E-014.48704E-02
2596.00796E-034.18378E-013.05251E-01
41007.48913E-053.85003E-026.27698E-01
81961.00044E-103.56872E-071.68219E-04
164001.85347E-089.09495E-139.09495E-12
328129.09495E-139.09495E-139.09495E-13
5031301.01875E-061.28805E-041.59819E-03
2591.56904E-012.10106E-015.43996E-02
41003.19229E-045.09871E-037.05012E-01
81969.09495E-121.62785E-062.24322E-04
164009.09495E-139.09495E-139.09495E-13
328129.09495E-139.09495E-139.09495E-13
1000.51212.79215E-102.24432E-042.35524E-01
2481.05239E-051.60394E-026.91068E-01
4922.53916E-059.10447E-049.95689E-01
81889.09495E-136.36646E-121.93759E-07
163879.09495E-139.09495E-139.09495E-13
328119.09495E-139.09495E-139.09495E-13
10011212.30102E-107.64159E-074.52594E-03
2483.37733E-042.04941E-051.91553E-01
4929.13360E-084.69068E-033.42508E-01
81889.09495E-131.81899E-122.46640E-06
163879.09495E-139.09495E-139.09495E-13
328119.09495E-139.09495E-139.09495E-13
10021214.54747E-125.82077E-109.55109E-04
2482.78850E-031.13837E-034.20773E-02
4925.18348E-087.73253E-064.44501E-01
81889.09495E-139.09495E-133.04778E-06
163879.09495E-139.09495E-139.09495E-13
328119.09495E-139.09495E-139.09495E-13
10031211.81899E-125.37816E-075.09440E-07
2485.27871E-074.12623E-053.63203E-03
4923.99203E-043.57299E-018.94343E-01
81881.00044E-101.27329E-112.92082E-07
163879.09495E-139.09495E-139.09495E-13
328119.09495E-139.09495E-131.85347E-08

The tests are based on the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used.

The tests are based on the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. The tests are based on the 40 replicates. st2, st3 and st4 are for score thresholds of 2, 3 and 4 respectively. K is the number of segments used. m is the total number of time points of the segments used. The results show the effectiveness of the proposed algorithm in detecting and estimating hidden nodes in large GRN, without knowing the number of hidden nodes and the variance of error terms.

Heterogeneous Variance in Error Terms in Synthetic Large GRN

In previous sections, the error terms of all genes have the same constant variance in the synthetic data. This is an admittedly restrictive assumption. In this section, we test our algorithm on heterogeneous variance in the error terms for the synthetic large GRNs generated in the previous section, to test how robust HCC-CLINDE is towards violation of the assumption of constant variance in error terms. We simulate data as above, but instead of a constant, the variance of error terms for gene i is generated as , where z ∼ N(2,δ 2), and we test different values of δ 2:0.05, 0.1, 0.2, 0.5, 0.7, 0.9, 1. Figs 14 and 15 show the median Effects F-scores against different δ 2 for different α and different number of time points for one segment case with score threshold st = 2 for n = 50 and n = 100 respectively; while Figs 16 and 17 show the results for multiple segment case with st = 2 for n = 50 and n = 100 respectively. The full table of median F-scores (with different st, and also Links and Delays performances including Recall, Precision and F-score) are given in Tables H and I in S1 File for one segment case and multiple segments case respectively.
Fig 14

Median Effects F-scores for n = 50 with Different δ 2 for One Segment Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Fig 15

Median Effects F-scores for n = 100 with Different δ 2 for One Segment Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Fig 16

Median Effects F-scores for n = 50 with Different δ 2 for Multiple Segments Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Fig 17

Median Effects F-scores for n = 100 with Different δ 2 for Multiple Segments Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Median Effects F-scores for n = 50 with Different δ 2 for One Segment Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Median Effects F-scores for n = 100 with Different δ 2 for One Segment Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Median Effects F-scores for n = 50 with Different δ 2 for Multiple Segments Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2.

Median Effects F-scores for n = 100 with Different δ 2 for Multiple Segments Large Case.

complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. st used is 2. First, note that the median Effects F-scores for complete (which is CLINDE on the complete data) and hiddenCL (which is CLINDE on the incomplete data) are mostly stable for different values of δ 2, which is reasonable because CLINDE ignores hidden common cause, and makes no assumption on the variances of the error terms. We have also performed one-sided Wilcoxon signed-rank test to test whether the median Effects F-score of hidden is better than hiddenCL. The p-values are shown in Table 11 for st = 2 for both one segment and multiple segments cases, and the p-values for st = 3 and st = 4 are in Tables J and K in S1 File for one segment and multiple segments cases respectively. With sufficient time points (or segments), for small δ 2, the F-scores of hidden (which is HCC-CLINDE on the incomplete data) is smaller than complete but larger than hiddenCL, which is qualitatively the same as the constant variance case in previous section. Up to δ 2 = 0.2, the p-values are less than 0.001 for 400 or 800 time points, and 16 and 32 segments. As δ 2 increases, the F-scores decrease, and usually at δ 2 = 0.5, the performance of hidden would be close to hiddenCL and that hidden is no longer significantly better than hiddenCL, indicating that HCC-CLINDE cannot effectively recover the hidden common causes and has no advantage over ignoring hidden common causes. However, note that when δ 2 = 0.2, the variances of most genes are between 1 and 3, and when δ 2 = 0.5, most variances are between 0.1 and 4, which are moderately heterogeneous. Therefore, the results show that HCC-CLINDE can tolerate slight violation of the assumption of constant variances in the error terms of the genes.
Table 11

P-values of one-sided Wilcoxon signed-rank test on whether the medians Effects F-scores of hidden is better than hiddenCL for the Heterogeneous Variance Large Case.

n α δ 2 nps = 200nps = 400nps = 800 K = 8 K = 16 K = 32
500.50.05 2.51475E-09 9.09495E-13 9.09495E-13 5.82077E-10 9.09495E-13 1.81899E-12
0.1 1.46175E-06 1.72804E-11 1.53705E-10 2.92082E-07 3.91992E-09 6.36646E-11
0.2 2.13304E-04 1.77279E-05 7.73253E-06 5.38597E-05 1.22187E-06 4.53190E-05
0.5 2.68777E-02 3.51204E-02 1.44876E-011.35193E-017.83223E-014.30922E-01
0.72.25867E-01 2.04007E-02 9.38085E-02 1.66783E-011.84100E-013.48342E-01
0.97.88755E-024.34017E-018.39846E-01 2.78850E-03 8.50441E-02 7.70093E-01
1.03.72302E-019.06192E-018.22954E-011.84100E-018.60265E-019.08685E-01
5010.05 1.53705E-10 2.72848E-12 1.81899E-12 1.26774E-07 9.09495E-13 9.09495E-12
0.1 1.85082E-09 3.00133E-11 1.72804E-11 3.17419E-06 1.00044E-10 4.06544E-10
0.2 3.41121E-02 1.31954E-05 2.01109E-04 5.53261E-04 3.57201E-04 1.58393E-04
0.5 4.91390E-02 2.96950E-02 1.41598E-01 4.35158E-02 2.59897E-02 6.03618E-02
0.7 3.62532E-02 2.90787E-015.02651E-011.91553E-012.66446E-017.89894E-01
0.95.47211E-012.90403E-018.94343E-011.63447E-011.44329E-013.82076E-01
1.01.73106E-017.05012E-018.08678E-014.30922E-019.73459E-017.82094E-01
5020.05 3.56872E-07 4.54747E-12 9.09495E-13 1.97442E-08 3.00133E-11 9.09495E-12
0.1 1.68219E-04 1.74174E-06 2.86163E-08 8.23745E-06 7.30443E-08 3.91992E-09
0.2 5.43599E-03 3.17871E-06 6.39177E-07 1.63188E-04 5.81749E-08 3.61960E-06
0.58.51795E-01 1.66629E-02 8.04618E-01 2.86594E-02 4.25285E-012.95777E-01
0.76.07236E-013.97724E-019.67926E-012.72400E-016.47656E-019.75042E-01
0.9 8.93482E-02 4.91650E-014.19823E-016.12502E-015.18552E-018.57140E-01
1.01.19160E-019.80935E-019.98489E-012.59264E-014.92047E-019.99624E-01
5030.05 2.22241E-03 1.03501E-08 1.81899E-12 4.61341E-08 1.72804E-11 2.27374E-11
0.1 6.83742E-04 2.84941E-04 1.24601E-10 3.35921E-05 5.81749E-08 2.51475E-09
0.25.52734E-01 6.16046E-05 1.75435E-05 2.01897E-04 8.35792E-06 7.98702E-05
0.5 1.00421E-03 4.09055E-02 2.23464E-01 1.79761E-03 1.03308E-01 8.68181E-02
0.7 3.37733E-04 1.23964E-01 2.65413E-02 2.17906E-011.38298E-01 5.90745E-02
0.9 1.78022E-02 4.70862E-013.52344E-015.18552E-012.63607E-017.40736E-01
1.0 1.66100E-02 2.42260E-019.90377E-015.44977E-015.50242E-019.77451E-01
1000.50.05 1.81899E-12 9.09495E-13 9.09495E-13 6.93035E-10 9.09495E-13 9.09495E-13
0.1 1.72804E-11 9.09495E-13 2.72848E-12 1.53705E-10 4.54747E-12 9.09495E-13
0.2 5.63730E-06 6.39177E-07 2.16005E-09 2.12223E-03 2.26382E-06 2.26382E-06
0.51.44329E-016.47656E-019.79599E-017.36393E-019.85053E-019.99799E-01
0.75.13254E-019.90737E-019.94901E-014.81448E-019.91421E-019.98546E-01
0.99.68908E-019.34221E-019.99999E-017.40736E-019.96814E-011.00000E+00
1.09.61500E-019.96208E-019.99842E-019.62402E-019.99476E-019.99997E-01
10010.05 1.81899E-12 9.09495E-13 9.09495E-13 2.92221E-09 9.09495E-13 9.09495E-13
0.1 5.21868E-09 9.09495E-12 6.36646E-11 2.63926E-07 8.22183E-10 1.53705E-10
0.2 3.56872E-07 1.85082E-09 2.92082E-07 3.22983E-07 1.53150E-08 4.07638E-06
0.5 1.44250E-02 8.71768E-02 1.73583E-01 1.66629E-02 5.59250E-02 1.18525E-01
0.76.17588E-016.37723E-019.93221E-012.94988E-018.12230E-019.63531E-01
0.99.23105E-019.06192E-019.99988E-012.42260E-019.55396E-019.99643E-01
1.01.41292E-019.94463E-019.99993E-01 3.96647E-02 7.76206E-019.99811E-01
10020.05 2.27374E-11 2.72848E-12 9.09495E-13 9.09495E-13 9.09495E-13 1.81899E-12
0.1 3.75407E-06 1.00044E-10 3.00133E-11 5.20249E-06 6.36646E-11 1.81899E-12
0.2 1.07730E-02 1.42158E-05 2.23736E-08 2.13304E-04 5.63730E-06 1.74014E-08
0.54.55023E-014.34017E-019.72496E-012.38104E-015.60747E-019.08441E-01
0.7 9.26339E-03 9.59143E-017.45043E-01 3.32894E-03 2.21867E-019.45600E-01
0.95.02651E-019.55396E-019.94463E-017.89894E-018.88969E-019.95689E-01
1.04.39254E-019.10652E-019.99503E-014.28791E-018.70430E-019.99999E-01
10030.05 1.08313E-03 1.33686E-06 9.09495E-13 2.14964E-07 1.24601E-10 9.09495E-13
0.1 1.39181E-02 6.10535E-06 1.81899E-12 1.57007E-07 4.06544E-10 2.27374E-11
0.2 6.00796E-03 1.31954E-05 3.45537E-06 1.68219E-04 1.90650E-05 2.39727E-04
0.51.22987E-016.52587E-017.23150E-01 5.20007E-02 1.11031E-019.35002E-01
0.7 4.12623E-05 7.82094E-019.83961E-015.50242E-019.21125E-019.59143E-01
0.9 5.76843E-03 9.37642E-019.99993E-018.43097E-019.99554E-019.99989E-01
1.01.84100E-017.53545E-019.99811E-018.29838E-019.81523E-011.00000E+00

The tests are based on the 40 replicates. st used is 2. K is the number of segments in multiple segment case. nps is the number of time points in single segment case.

The tests are based on the 40 replicates. st used is 2. K is the number of segments in multiple segment case. nps is the number of time points in single segment case.

Small YEASTRACT Subnetworks with Real Data

Preprocessing of Subnetworks

We accessed YEASTRACT [55] (http://www.yeastract.com/formfindregulators.php) on 7th Feb, 2015 to get the regulating TFs of a list of 149 TFs. We chose “DNA binding and expression evidence” (which gives more confidence than having only binding evidence or expression evidence alone) and queried twice, once with “TF acting as activator” and once with “TF acting as inhibitor” to try to get the regulatory effects (positive or negative) of the regulatory relationships. The obtained 392 links involve only 129 TFs, and we used “ORF List ⇔ Gene List” utility of YEASTRACT (http://www.yeastract.com/formorftogene.php) to convert the gene names into ORF id’s, and all these 129 id’s appear in the yeast cell cycle [53] data. However, the GRN is too large for the limited data, so we have chosen 22 subnetworks with sizes and constituent TFs shown in Table 12. For each subnetwork, a TF (which has children in the subnetwork) is chosen to be the hidden node. As the delays in the links are not known, and for a TF pair, some links may be both positive and negative in the YEASTRACT network, so we focus on the performance on Links.
Table 12

YEASTRACT Subnetworks.

sn n n L Hidden TFOther TFs
sn145MBP1ASH1, HCM1, SWI4
sn2511GLN3DAL80, GAT1, GCN4, UGA3
sn365ADR1IME1, MSN4, PIP2, STE12, USV1
sn465ASH1ACE2, MBP1, SWI5, TOS8, YHP1
sn566YAP6CBF1, CIN5, MOT3, PDR1, TUP1
sn6610MSN2ADR1, FHL1, NRG1, SOK2, USV1
sn7612DAL80GAT1, GLN3, STE12, SUM1, TEC1
sn876ACE2ASH1, FKH2, GAT1, HMS2, INO4, SFL1
sn977MET4ABF1, HAP4, MET28, MET32, SFP1, TYE7
sn1077MSN4ADR1, HAL9, RAP1, ROX1, RPN4, SOK2
sn1177UME6GAT1, GSM1, LEU3, MSN2, OAF1, SIP4
sn1278STE12MIG2, MSN2, PDR1, PDR3, SOK2, YAP1
sn1379CIN5FLO8, IXR1, NRG1, XBP1, YAP1, YAP6
sn14711MCM1MET32, STE12, SWI4, SWI5, TYE7, YAP3
sn15711RAP1FKH1, FKH2, MCM1, SFP1, STE12, SWI5
sn16714FLO8CIN5, HCM1, HMS1, STE12, TEC1, TOS8
sn17912PDR1HAP4, MET28, PDR3, RPN4, SFL1, SWI4, YAP5, YAP6
sn18916RPN4HSF1, MSN2, MSN4, PDR1, PDR3, PUT3, REB1, YAP1
sn191017STE12CBF1, HAP4, MET4, MSN2, PDR1, RAP1, ROX1, SOK2, YAP1
sn201113ABF1DAL81, ECM22, HAP1, HMS2, MET4, MGA1, REB1, RTG3, STP1, SUM1
sn211223STE12ASH1, FLO8, OAF1, RAP1, RFX1, SFP1, SKO1, SOK2, TEC1, TOS8, XBP1
sn221338ROX1FHL1, HAP1, HAP4, HMS1, IXR1, MSN2, MSN4, SKN7, SKO1, STE12, XBP1, YAP1

sn is the subnetwork. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. The hidden TF is the one with expression hidden in incomplete setting of the experiments.

sn is the subnetwork. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. The hidden TF is the one with expression hidden in incomplete setting of the experiments.

Preprocessing of Expression Data

We downloaded the yeast cell cycle [53] data from http://genome-www.stanford.edu/cellcycle/. The expression data contains 4 time series: alpha, cdc15, cdc28 and elu, with different lengths and time points, as shown in the second column of Table 1. Since the time steps of the 4 time series are not all the same, for each TF, we perform spline interpolation (using the spline() function in R) to the time points shown in the third column of Table 1. Also, some TFs in some series are entirely missing (this seems more often in cdc15), and we fill in with zero. For other missing values, we rely on the spline interpolation to fill in the value. After this, for each subnetwork we use the converted ORF id of each TF to retrieve expression data of the TFs on the 4 time series, to obtain one set labeled complete which contains the expression data of all TFs of the subnetwork, and another labeled incomplete which contains the expression data of all but the chosen TF of the subnetwork. Therefore, for each subnetwork, we have 8 expression datasets. The Links F-scores of CLINDE with complete data, our proposed algorithm (with normalization of data, with unknown σ 2 and unknown number of hidden nodes) on incomplete data, and CLINDE on incomplete data on the subnetworks are shown in Table 13, where we set the maximum delay τ 0 to be 4, and use the score threshold of 1 as the number of time points is not large. We run the algorithms (ours and CLINDE) separately on one time series and used the 4 time series together (all), because both CLINDE and our proposed algorithm can handle multiple segments of time series. We show which segment(s) have the best performance.
Table 13

Links F-scores for YEASTRACT Subnetworks.

complete (C)hidden (H)H/ChiddenCL
sn n n L bestwhichbestwhichbestbestwhich
sn145 0.889 cdc280.444cdc28, elu50.0%0.364alpha, cdc28
sn2511 0.476 cdc150.324all, alpha68.1%0.267alpha
sn3650.286cdc28 0.364 cdc15 127.3% 0.000
sn465 0.333 alpha 0.333 cdc15100.0%0.000
sn566 0.308 alpha 0.308 cdc28100.0%0.000
sn6610 0.646 cdc280.545elu84.4%0.471elu
sn76120.526cdc15 0.556 elu 105.6% 0.455cdc15
sn8760.286cdc28 0.333 elu 116.7% 0.000
sn977 0.429 cdc280.375cdc1587.5%0.154cdc28
sn10770.293cdc28 0.353 cdc15 120.6% 0.000
sn1177 0.353 cdc28 0.353 alpha100.0%0.133cdc28
sn12780.421elu0.444cdc15 105.6% 0.500 cdc15
sn13790.305all 0.375 alpha 122.8% 0.083all
sn147110.381cdc28 0.421 cdc28 110.5% 0.174cdc15
sn157110.320elu 0.385 all 120.2% 0.245cdc15, cdc28
sn167140.441elu 0.545 cdc15 123.6% 0.417elu
sn179120.296elu 0.364 alpha 122.7% 0.174elu
sn189160.214cdc15 0.367 cdc15 171.3% 0.154cdc15
sn1910170.253cdc28 0.333 cdc15 131.9% 0.253cdc28
sn2011130.282cdc15 0.435 alpha 154.3% 0.000
sn211223 0.386 elu0.326elu84.5%0.295elu
sn2213380.190elu 0.309 cdc15 162.2% 0.136elu

sn is the subnetwork. The score threshold is 1. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage. best is the best of the four segments, and which shows the best segment (all is using all 4 segments). The maximum delay τ 0 used is 4.

sn is the subnetwork. The score threshold is 1. n is the number of TFs in the subnetwork, n is the number of links in the subnetwork. complete is using CLINDE on the complete data. hidden is our proposed algorithm on the incomplete data. hiddenCL is CLINDE on the incomplete data. H/C is the F-score of hidden divided by that of complete and shown as percentage. best is the best of the four segments, and which shows the best segment (all is using all 4 segments). The maximum delay τ 0 used is 4. First, as the 4 time series segments are from different experimental conditions, so it is possible that different genes have better responses in different segments. Consequently, using all 4 segments (all) may not give the best results, even though the total number of time points is larger. Second, our proposed algorithm on incomplete data has better F-scores than CLINDE on complete data in 14 out of 22 subnetworks, with 3 ties. There are two possible reasons. One is that after estimating the initial GRN, the subsequent steps make some assumptions on the structure of the GRN around the hidden common cause(s). This may help the GRN inference, especially in case of limited data. The other is that after estimating the hidden common cause(s), the re-estimation of the links around the hidden common cause(s) works only on a subnetwork. However, we expect that as the amount of (quality) data increases, the situation would be more like the synthetic large case, where CLINDE on complete data has better performance, but our algorithm is close to it. Third, our proposed algorithm on incomplete data outperforms CLINDE on incomplete data in 21 out of 22 subnetworks. This is as expected, as CLINDE is unaware of the presence of hidden node(s), so there could be misleading links (as illustrated in Fig 1). In short, due to limited real expression data, we cannot draw very strong conclusions, but the results indicate the potential of our proposed algorithm in recovering hidden common cause(s) using real data, when the error variance and the number of hidden node(s) is unknown.

Discussions

In this section, we discuss possible extensions to the basic algorithm introduced above to relax some assumptions made.

Variance of Error Terms

In the proposed algorithm, we assume that the variance of the error terms is a constant. This assumption is used in detection of genes with hidden common cause, and in calculation of correlation threshold in clustering. This assumption can be relaxed, for example to assume that the error terms of some genes have variance and some have variance . After inferring an initial GRN and calculating the empirical variances of the error terms, instead of using the median as estimate of the true variance, we may cluster the empirical variances and use the centers of the two largest clusters as estimates of and . Having obtained the estimates (or assumed given if the number of observed genes is too small to allow good estimation), those observed genes with empirical variances too far away from the estimates are predicted to have hidden common cause. The variance of the error terms is also used in calculating the threshold of absolute correlation in clustering the genes having the same hidden common cause. If we assume more than one possible variance, we may use the larger one to calculate a conservative threshold. Alternatively, for each observed gene, we may choose the estimate closer to the empirical variance as its true variance, and calculate the threshold accordingly. At this point, it is unclear which method gives better clustering, and further study is needed. However, it is difficult to handle a very general and spread-out distribution of the variances of the error terms. If the distribution is unknown and we desire to estimate it from the empirical variances, a large number of observed genes is needed for meaningful estimation, but this increases the difficulty of inference of the initial GRN substantially because of the increased number of genes. Even if we assume the distribution is known or well estimated, given an empirical variance, we need to decide whether this is as expected, which is essentially a statistical test, which is more difficult for less concentrated distribution.

Discrete Data

The proposed algorithm handles continuous data, here we discuss the possibility of extending it to discrete time series data. For this purpose, a few parts would need to be adapted. Firstly, the data model would be different. Instead of a linear combination of its parents’ values (with time delay) plus an additive error, the value of a node would depend on its parents through a conditional distribution. In addition, we assume that given each configuration of the parents, the conditional distribution has most probability (e.g. 1−p ) concentrated on a value, and the remaining p is spread over other values, i.e. the value of the node is roughly a function of its parents, with a “noise level” of p . Secondly, the inference of initial GRN needs adaptation. This is simple because similar to the PC algorithm [29], CLINDE can use any type of independence and conditional independence test suitable for the data. For example, χ 2 tests could be used instead of the correlation test. Thirdly, we need to predict the nodes with hidden common cause based on the initial GRN. Given an initial GRN, we could estimate the maximum likelihood conditional distribution of each node from the data, and analogous to variance of the error terms, the empirical “noise level” could be calculated and compared with the expected level to predict whether that node has hidden common cause. If we assume a constant “noise level” for all the nodes, it could be given (for small number of nodes) or estimated from empirical “noise levels” (for large number of nodes). Fourthly, for the clustering of nodes having the same hidden common cause, since two nodes having the same parent would be associated, so it is only necessary to determine an appropriate threshold for the association measure based on the “noise levels”. Lastly, instead of using Singular Value Decomposition to estimate hidden common cause(s), a different method has to be used. For example, Structural-EM could be used as in [51], and the method in [52] could be used for the related problem of determining the number of states for the hidden common cause(s).

Relax Structural Assumptions

Nodes with More than One Hidden Common Cause

In the basic algorithm, we assume that if a gene has a hidden parent, it has no other parents. Here we consider the possibility that a gene has two or more hidden parents (both hidden common causes). This should pose little difficulty for predicting the genes with hidden common cause(s), as they would have the wrong parent(s) in the initial GRN, and consequently the empirical variance of they error terms are likely different from expected. On the other hand, the clustering and estimation of hidden common causes would need some adaptation. For clustering, the determination of the correlation threshold is not as straightforward. One simple strategy is to use a fixed conservative threshold, and hope that genes without the same hidden common cause(s) would not be close enough to have high correlation. And for the estimation of the hidden common cause(s), we may still apply SVD, but use more singular vectors corresponding to the largest singular values. The main difficulty is that a way is needed to determine how many singular vectors to take, i.e. how many hidden common causes the set of genes have. One possible strategy is to successively take more singular vectors, until the drop in residual error is small. But this is a sort of model selection problem, which is not straightforward. Therefore, more study is needed to determine whether the strategy would work well.

Multiple Layers of Hidden Nodes

The current model and algorithms assumes that any hidden common cause does not have other hidden common cause as parents, i.e. the hidden common causes are not connected directly to each other. This simplifying assumption may be restrictive in some cases, where multiple layers of hidden common causes may exist and may have meaningful interpretation. One possible strategy is to first infer possible hidden common cause(s) of observed genes, then treat them as observed (because our algorithm also estimates the time series of the predicted hidden common cause(s)), and repeat the process to see if further hidden common cause(s) could be inferred. However, at this stage, it is unclear whether this strategy would work well, because the estimated time series of the hidden common cause(s) may not be accurate enough. Moreover, it is unclear that whether it is even feasible to infer rich structures among hidden nodes without prior knowledge or assumptions, given only limited data of observed nodes. We leave these issues as future work.

Relative Error Model

The proposed algorithm assumes a constant variance for the additive error terms, and the variance could be known or estimated from empirical variances. Alternatively, we may assume a model where the variance of the error terms is a constant proportion of the variance of the gene. This could be easily handled by (centering and) normalizing the input expression data such that each gene has a variance of one (except those with constant expression). The proportion of variance of error terms to the variance of the gene would remain unchanged, but now it is also the variance of the error terms. This normalization does not affect the inference of the initial GRN, as CLINDE uses correlation and conditional correlation tests, which are unaffected by centering and scaling.

Delays with Distribution

In our current model, we assume the delay between two genes is deterministic (but unknown and is to be found). While this assumption makes the model and algorithm simpler, in reality, due to the stochastic operations of the cell, it is more realistic to model the delays as random variable, e.g. with exponential distribution as in [32] and [33]. We leave this as future work.

Not Using Time Series Data

The basic algorithm uses time series data, here we briefly consider the possibility of using non-time series data. To begin with, without time series data, it would be impossible to estimate the delays in the links, but it may still be possible to infer the directions of the links, just as in the PC algorithm [29]. However, without time series data, inferring the directions of the links is much more difficult, especially when we allow the presence of cycles in the causal network. Also, the directions of some links may still remain undetermined given the data, because both directions are consistent with the data. The clustering and estimation of hidden common cause, on the other hand, does not pose great difficulty when using non-time series data, assuming the initial GRN is well estimated.

Conclusion

To summarize, we have developed an algorithm to recover the causal GRN with delays in the regulatory links from time series expression data, where a small but unknown number of nodes are hidden, i.e. without expression data. We have tested our algorithm on 3 types of synthetic data: small GRN with one hidden node, small GRN with no hidden node, and large GRN with a small but unknown number of hidden nodes. Results on synthetic data show that our algorithm can effectively recover the causal GRN. We have also demonstrated our algorithm on small subnetworks of YEASTRACT with real expression data, and the results show the potential of our algorithm to recover hidden nodes using real data.

Containing various supplementary figures and tables, listed as follows.

Figure A, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Small Hidden Case. Figure B, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Small Hidden Case. Figure C, Profiles of F-scores of Links, Delays and Effects for Different Settings for One Segment Small Non-Hidden Case. Figure D, Histogram of Absolute Differences of F-scores of Links and Effects for One Segment Small Non-Hidden Case. Figure E, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Small Non-Hidden Case. Figure F, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Small Non-Hidden Case. Figure G, Histogram of Absolute Differences of F-scores of Links and Effects for One Segment Large Case. Figure H, Histogram of Absolute Differences of F-scores of Delays and Effects for One Segment Large Case. Figure I, Histogram of Absolute Differences of F-scores of Links and Delays for One Segment Large Case. Figure J, Boxplot of Effects F-score with Different e2 is σ 2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table B, Full Results of Median Performance for One Segment Small GRN without Hidden Node. e2 is σ 2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table C, P-values of one-sided Wilcoxon signed-rank test on whether the medians e2 is σ 2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table E, Full Results of Median Performance for Multiple Segments Small GRN without Hidden Node. e2 is σ 2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 20 replicates. Table F, Full Results of Median Performance for Multiple Segments Large GRN with More than One Hidden Node with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes n , e2 is σ 2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 40 replicates. Table G, P-values of one-sided Wilcoxon signed-rank test on whether the medians δ 2 with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes n , e2 is σ 2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. d2 is δ 2. Each median is taken over 40 replicates. Table I, Full Results of Median Performance for Multiple Segments Large GRN with More than One Hidden Node with n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes n , e2 is σ 2, alpha is α, nsegs is number of segments K, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. d2 is δ 2. Each median is taken over 40 replicates. Table J, P-values of one-sided Wilcoxon signed-rank test on whether the medians n = 50 and n = 100. ng is the number of observed genes n, nh is the number of hidden nodes n , e2 is σ 2, alpha is α, nps is number of time points m, st is the score threshold. For the data column, hidden means using our proposed algorithm on the incomplete data, complete means using CLINDE on the complete data (i.e. all nodes are not hidden), hiddenCL means using CLINDE on the incomplete data. l2., d2. and e2. are Links, Delays and Effects respectively. r is Recall, p is Precision and f is F-score. Each median is taken over 40 replicates. (ZIP) Click here for additional data file.
  30 in total

1.  Autoinhibition with transcriptional delay: a simple mechanism for the zebrafish somitogenesis oscillator.

Authors:  Julian Lewis
Journal:  Curr Biol       Date:  2003-08-19       Impact factor: 10.834

2.  Inferring Time-Delayed Causal Gene Network Using Time-Series Expression Data.

Authors:  Leung-Yau Lo; Kwong-Sak Leung; Kin-Hong Lee
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2015 Sep-Oct       Impact factor: 3.710

3.  Uncovering gene regulatory networks from time-series microarray data with variational Bayesian structural expectation maximization.

Authors:  Isabel Tienda Luna; Yufei Huang; Yufang Yin; Diego P Ruiz Padillo; M Carmen Carrion Perez
Journal:  EURASIP J Bioinform Syst Biol       Date:  2007

4.  Learning gene regulatory networks from gene expression measurements using non-parametric molecular kinetics.

Authors:  Tarmo Aijö; Harri Lähdesmäki
Journal:  Bioinformatics       Date:  2009-08-25       Impact factor: 6.937

5.  Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis.

Authors:  Zheng Li; Ping Li; Arun Krishnan; Jingdong Liu
Journal:  Bioinformatics       Date:  2011-08-04       Impact factor: 6.937

6.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.

Authors:  P T Spellman; G Sherlock; M Q Zhang; V R Iyer; K Anders; M B Eisen; P O Brown; D Botstein; B Futcher
Journal:  Mol Biol Cell       Date:  1998-12       Impact factor: 4.138

7.  Delay-induced degrade-and-fire oscillations in small genetic circuits.

Authors:  William Mather; Matthew R Bennett; Jeff Hasty; Lev S Tsimring
Journal:  Phys Rev Lett       Date:  2009-02-13       Impact factor: 9.161

8.  Inference of gene regulatory networks and compound mode of action from time course gene expression profiles.

Authors:  Mukesh Bansal; Giusy Della Gatta; Diego di Bernardo
Journal:  Bioinformatics       Date:  2006-01-17       Impact factor: 6.937

9.  How to infer gene networks from expression profiles.

Authors:  Mukesh Bansal; Vincenzo Belcastro; Alberto Ambesi-Impiombato; Diego di Bernardo
Journal:  Mol Syst Biol       Date:  2007-02-13       Impact factor: 11.429

10.  The YEASTRACT database: an upgraded information system for the analysis of gene and genomic transcription regulation in Saccharomyces cerevisiae.

Authors:  Miguel Cacho Teixeira; Pedro Tiago Monteiro; Joana Fernandes Guerreiro; Joana Pinho Gonçalves; Nuno Pereira Mira; Sandra Costa dos Santos; Tânia Rodrigues Cabrito; Margarida Palma; Catarina Costa; Alexandre Paulo Francisco; Sara Cordeiro Madeira; Arlindo Limede Oliveira; Ana Teresa Freitas; Isabel Sá-Correia
Journal:  Nucleic Acids Res       Date:  2013-10-28       Impact factor: 16.971

View more
  2 in total

1.  Detecting Causality by Combined Use of Multiple Methods: Climate and Brain Examples.

Authors:  Yoshito Hirata; José M Amigó; Yoshiya Matsuzaka; Ryo Yokota; Hajime Mushiake; Kazuyuki Aihara
Journal:  PLoS One       Date:  2016-07-05       Impact factor: 3.240

2.  Identifying time-delayed gene regulatory networks via an evolvable hierarchical recurrent neural network.

Authors:  Mina Moradi Kordmahalleh; Mohammad Gorji Sefidmazgi; Scott H Harrison; Abdollah Homaifar
Journal:  BioData Min       Date:  2017-08-03       Impact factor: 2.522

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.