Literature DB >> 29937735

Identification of Boolean Network Models From Time Series Data Incorporating Prior Knowledge.

Thomas Leifeld¹, Zhihua Zhang¹, Ping Zhang¹.

Abstract

Motivation: Mathematical models take an important place in science and engineering. A model can help scientists to explain dynamic behavior of a system and to understand the functionality of system components. Since length of a time series and number of replicates is limited by the cost of experiments, Boolean networks as a structurally simple and parameter-free logical model for gene regulatory networks have attracted interests of many scientists. In order to fit into the biological contexts and to lower the data requirements, biological prior knowledge is taken into consideration during the inference procedure. In the literature, the existing identification approaches can only deal with a subset of possible types of prior knowledge.
Results: We propose a new approach to identify Boolean networks from time series data incorporating prior knowledge, such as partial network structure, canalizing property, positive and negative unateness. Using vector form of Boolean variables and applying a generalized matrix multiplication called the semi-tensor product (STP), each Boolean function can be equivalently converted into a matrix expression. Based on this, the identification problem is reformulated as an integer linear programming problem to reveal the system matrix of Boolean model in a computationally efficient way, whose dynamics are consistent with the important dynamics captured in the data. By using prior knowledge the number of candidate functions can be reduced during the inference. Hence, identification incorporating prior knowledge is especially suitable for the case of small size time series data and data without sufficient stimuli. The proposed approach is illustrated with the help of a biological model of the network of oxidative stress response. Conclusions: The combination of efficient reformulation of the identification problem with the possibility to incorporate various types of prior knowledge enables the application of computational model inference to systems with limited amount of time series data. The general applicability of this methodological approach makes it suitable for a variety of biological systems and of general interest for biological and medical research.

Entities: Chemical Disease Gene Species

Keywords: Boolean networks; identification; network inference; prior knowledge; time series data

Year: 2018 PMID： 29937735 PMCID： PMC6002699 DOI： 10.3389/fphys.2018.00695

Source DB: PubMed Journal: Front Physiol ISSN： 1664-042X Impact factor: 4.566

1. Introduction

Boolean networks (BNs) are discrete-time systems, whose variables can take only two possible values (i.e., 0 and 1). Since Stuart Kaufman firstly introduced BNs in Kauffman (1969) for qualitative description of gene regulatory interactions, BNs have attracted great attention from many scientists and several results have been proposed, for instance, analysis (Albert and Barabási, 2000) and control (Fauré et al., 2006). An overview can be found in Wang et al. (2012) and a database for established models and compatible tools has been introduced (Naldi et al., 2015). Mathematical models are important to explain dynamic behavior of a system and to understand the functionality of system components (Grieb et al., 2015) and can help scientists to design model-based targeted therapy and diagnosis (Fumia and Martins, 2013). Hence, the inference of models capturing the relevant behavior of the system is an important topic. The inference can be based on the connection of known biochemical reactions, like BN model for the yeast cell cycle in Davidich and Bornholdt (2008), or on experimental data, if the latter is the case it is also called the identification problem. One of the first approaches to identify a BN was REVEAL which is based on mutual information (Liang et al., 1998). In Akutsu et al. (1999) a similar but less complex approach is presented. Both cannot handle errors in the dataset which was solved in Lähdesmäki et al. (2003). The modeled quantities are not Boolean in the experimental data and need to be binarized first. For the binarization several approaches can be found in the literature ranging from mixture model based clustering (Zhou et al., 2003) to more complex methods where the significance of a jump in the time series is estimated in Hopfensitz et al. (2012). A comparison of some identification and binarization approaches and their combinations can be found in Berestovsky and Nakhleh (2013). Most identification approaches are based on previously binarized data, but there also exist approaches directly based on continuous data (e.g., Karlebach and Shamir, 2012). In Higa et al. (2011) the data is considered as given constraint and the set of systems fulfilling the constraints is searched. This approach was then further improved by reducing the sensitivity to noise in Ouyang et al. (2014). An example of recent research is the identification of Boolean models for transient dynamics after perturbations from time course data with answer set programming (Ostrowski et al., 2016). A BN can simply be extended to a Boolean control network (BCN) by considering manipulated external stimuli as control signal of the network. Recently, a powerful tool called semi-tensor product (STP) of matrices has been proposed in Cheng (2001), which can convert the dynamics of BCNs into a model where all information of the dynamics and the structure of the BCN is contained in two matrices (Cheng et al., 2011a). Using the STP based matrix description of BCN several approaches for identifying BCN have been proposed (Cheng and Zhao, 2011; Fornasini and Valcher, 2014; Zhang et al., 2017a). However, in general, in order to identify the dynamical model of a BCN from its input and output data, a huge number of data is required (Cheng and Zhao, 2011; Cheng et al., 2011b). Though, in practice, data size is limited by the cost of experiments (Geier et al., 2007). In order to reduce the search space and improve the accuracy of the model, the benefit of biological prior knowledge should be taken into consideration. The prior knowledge is used in different ways either by introducing additional constraints in the optimization (Breindl et al., 2013), or reducing the number of parameters in the optimization (Cheng and Zhao, 2011). In Dorier et al. (2016) and Terfve et al. (2012) genetic algorithms are used to handle the complexity problem of large networks while satisfying prior knowledge network graphs as constraints. However, these approaches to handle prior knowledge are not compatible and the advantages of different types of prior knowledge can not be combined. In the approach proposed in this paper, all different types of prior knowledge can be utilized simultaneously and it can additionally handle hypotheses for interactions, which could be used for researcher bias free distinction between alternative hypotheses. Furthermore existing approaches can not handle the case that at some time instances some measurement values are missing, which cannot be avoided in practice due to the limitation of measuring techniques like mass spectrometry-based proteomics. In this paper, we consider the identification problem of BCNs utilizing biological prior knowledge. A part of the results was presented at the 56th IEEE Conference on Decision and Control in Melbourne (Zhang et al., 2017b). However, the BCN model considered in Zhang et al. (2017b) contains a general output equation. By applying prediction error method (PEM), a high-dimensional BCN (i.e., 2 × 2) cannot be avoided. Different from that, although the handling of unmeasurable processes is considered in this paper, the proposed approach leads to a low-dimensional matrix for PEM. Besides, more prior biological knowledge is considered in the paper, like potential interactions, known attractors and limit cycles. Moreover, it is discussed how to deal with alternative hypotheses for interactions and missing measurement points. The main contributions of this paper are as follows: A suitable way to handle the prior knowledge such as known network graph, hypotheses for interactions, canalizing and unateness properties or attractor is introduced. For this purpose the BCN is described by two matrices with unknown parameters as entries. If possible, some parameters are inferred directly. Otherwise, relationships between the parameters are set up. An approach to deal with the identification of BCNs, in particular, from noisy measurements and missing data points is proposed. The identification problem of BCNs is formulated as a nonlinear pseudo-Boolean optimization, which can be equivalently transformed into a linear binary optimization problem and then solved efficiently. The remainder of the paper is organized as follows. Section 2 introduces some fundamental definitions and notations. In Section 3, the identification problem of BCNs addressed in this paper will be formulated. Section 4 introduces a way to utilize prior knowledge in identification procedure. The formulation of identification problem of BCNs as an integer linear programming problem is derived and an example is given in Section 5 to illustrate the approach. Finally, a short discussion on the advantages and limitations of the proposed approach is given in Section 6.

2. Preliminaries

In this part, we list some necessary notations, which will be used in the subsequent sections. ¬, ∧ and ∨ denote the logical negation (not), conjunction (and) and disjunction (or), respectively. and . , where denotes the k-th column of the identity matrix I. For a vector v ∈ ℝ, its j-th entry is denoted by [v], j = 1, 2, ⋯, m. An n × t matrix L is called a logical matrix, if , where i1, i2, ⋯, i ∈ {1, 2, ⋯, n}, and we express L briefly as L = δ[i1i2 ⋯ i]. Denote the set of n × t logical matrices by . Col(M) denotes the i-th column of the matrix M. , where the superscript T denotes the transpose. The concept of the semi-tensor product of matrices (STP) has been introduced by Cheng et al. (2011a). The STP of two matrices A ∈ ℝ and B ∈ ℝ is defined as where ⊗ is the Kronecker product and l = lcm{n, p} is the least common multiple of n and p. The following property of the STP will be used in the subsequent sections. Lemma 1. Let . So the order of two vectors which are multiplied can be altered by multiplying a suitable matrix from the left, this is also called the pseudo-commutativity of the STP. In the following parts the symbol ⋉ will be omitted.

3. Problem formulation

System identification is the determination of a model describing the dynamic behavior of a system based on measured data and known perturbations. In the context of Boolean modeling it is assumed that the transient behavior of the system can be qualitatively described by a finite number of Boolean states and that the interaction of these states can be described by Boolean functions. The perturbations are inputs to the system and cause transient behavior of the interacting states in the system. A measured time series of inputs and states form together the data basis for the identification. Depending on the system which is to be modeled, the states might represent the activity of genes or the abundance of proteins and the perturbations could be a stress like heat or oxygen or a chemical substance. In the following the identification process will be formulated as mathematical optimization problem. Therefore the mathematical model of a BCN needs to be defined first. A Boolean control network (BCN) can be described by the following equations (Cheng and Qi, 2010): where , are, respectively, the state vector, input vector at time t, f are logic functions. At the discrete time instances t the state variables are updated synchronously according to the logic functions f. As shown in Cheng and Qi (2010), a vector form of Boolean variable X, i = 1, 2, ⋯, n can be simply expressed as Let . According to Cheng and Qi (2010), (2) can be equivalently represented in a vector form: where are logical matrices. Multiplying all Equations in (4) together, there is where is a logical matrix and . A polynomial with k variables {θ1, θ2, ⋯, θ} is called multi-linear polynomial, if its degree in each variable is at most 1 (Alon et al., 1991). So, a multi-linear polynomial can be generally expressed as where for and the set has a cardinality of at least 2, i.e., . Generally, the identification problem of BCNs can be described as reconstruction of Boolean functions f, i = 1, 2, ⋯, n that explain the experimental data as well as possible. Because of equivalent representation of a Boolean function by a logical matrix, the identification problem is reformulated as searching for logical matrices based on the input and measurement state data. Note that any logical matrix in can be expressed by multi-linear polynomials in a binary parameter vector θ of dimension a · 2. For example, any logical matrix in can be expressed by a binary parameter vector as where the superscript T denotes the transpose. In this way, each realization of the binary parameter vector corresponds to a unique logical matrix. It is straightforward to equivalently convert this logical matrix into readable logical equations. Based on this, the objective of the paper is to find a binary parameter vector θ, such that dynamic behavior of the BCN (5) is consistent with the important dynamics captured in the observed input-state data.

4. Incorporation of prior knowledge

In this section, we shall show how to utilize known network graph, potential interactions, canalizing and unateness properties and attractors in the identification procedure.

4.1. Known or potential interactions

Often some or all interaction partners are known in a biological system which is subject of identification. This knowledge can come from databases or can be constructed based knowledge about the underlying biochemical reactions. In some cases a known signaling network is to be complemented and different hypothesis for potential interactions shall be evaluated. If all interaction partners and the direction of the interactions are known, the underlying directed network graph of the BN is known. In graph theory, a directed graph can be denoted by , where is a finite set of nodes and is a finite set of edges (Bollobas, 2012). If , then there is an edge from v → v. According to Cheng et al. (2011a), a BCN can be represented by a directed graph, where each gene is considered as a node. If there is an edge from X → X, then X is affected by X. Assume that a directed graph for a BCN is known. Then we have the following result. Lemma 2. If the node . Proof. As the node x is affected by w nodes, then the Boolean function can be represented in a vector form as where the matrix S is a logical matrix of dimension 2 × 2. Recall that the logical matrix S is a matrix containing only columns belonging to Δ2 (Cheng et al., 2011a). Hence, 2 binary parameters are enough for the description of the logical matrix S. An example is given below to express logical matrices of a BCN with a known network graph with the help of binary parameters. Example 1. Consider a BCN as follows. where the network graph of the BCN is shown in Figure 1 (Cheng and Zhao, 2011). According to Cheng and Qi (2010), the algebraic form of the BCN is obtained,

Figure 1

Network graph.

where the logical matrices can be expressed by the binary parameter vector in the following form: Network graph. Potential interactions can be treated in the same way as known interactions as long as all of them could potentially be simultaneously true. If there are two alternative hypotheses and the question is which fits better to the data, then this can be done by introducing a constraint on the parameters θ. Example 2. Assume that X1 is influenced either by X2 or by U1, this could be ensured by imposing the constraint

4.2. Canalizing boolean functions

The concept of “canalizing” values in Boolean functions was introduced in developmental biology in 1940s (Waddington, 1942). The idea is, that one input is dominant and if it takes a certain value it determines the output. After that, in order to explain the phenomenon that absence of repressor or high levels of allolactose assures the operator cannot bind repressor in lac operon of the bacterium Escherichia coli, Kauffman applied this concept to BN modeling of gene regulatory networks (Kauffman, 1974). Canalizing functions are defined as follows. Definition 1. A Boolean function is canalizing if there exist a variable X, i ∈ {1, 2, ⋯, n} and a Boolean function g(X1, ⋯, X, X, ⋯, X) and , such that where a is called the canalizing value for the variable X and b is the canalizing output value (Kauffman, 1974). Based on Definition 1, this prior knowledge can be translated into imposing a specified value in the corresponding logical matrix. Assume that the logical matrix for the canalizing function (10) is denoted as S and the canalizing value a and canalizing output b can, respectively, be expressed in a vector form as and . Then, we can get the following result. Theorem 1. Given a canalizing function (10). The corresponding logical matrix where is the swap matrix. Proof. According to Lemma 1, it is easy to obtain . Applying (11), we have which corresponds to f(X1, ⋯, X, a, X, ⋯, X) = b for any X1, ⋯, X, X, ⋯, X ∈ {0, 1}. Let's take an example to illustrate the result of Theorem 1. Example 3. Consider the BCN (7). Assume that the Boolean function f1 is a canalizing function in x2 for a canalizing value and the corresponding canalizing output is . Due to the canalizing property, the logical matrix S1 can be reduced to It can be checked that , no matter whether or . Note that the logical matrix S1 contains only two binary parameters (i.e., θ1 and θ3). It shows that using canalizing property can reduce the number of binary parameters. As an important subclass of canalizing function, k-canalizing function is defined as follows. Definition 2. Let σ be a permutation on the set {1, 2, ⋯, n}. A Boolean function is k-canalizing in the variable order Xσ(1), Xσ(2), ⋯, Xσ( with canalizing input values a1, a2, ⋯, a and canalizing output values b1, b2, ⋯, b, if it can be represented in the form (Kauffman et al., 2003). Note that if all variables have certain canalizing values, then the function is called nested canalizing function (Kauffman et al., 2003). As a Boolean variable can only take two values, i.e., {0, 1}, (12) can be equivalently expressed as f(X1, ⋯, X) = b, if Xσ(1) = 1−a1, Xσ(2) = 1−a2, ⋯, Xσ( = a, i = 1, 2, ⋯, k. Using the Boolean variables to represent a multi-valued logic variable, it is straightforward to recognize that a k-canalizing function can be equivalently formulated as a canalizing function in a multi-valued logic variable. Therefore, Theorem 1 can be applied to specify the logical matrix for k-canalizing or nested canalizing function (12). It is necessary to point out that different from the approaches proposed in Breindl et al. (2013) and Faisal et al. (2010), some binary parameters can be directly inferred, no matter which canalizing value the canalizing variable takes. Example 4. Consider the BCN (7). Assume that the Boolean function f2 is nested canalizing function, which can be represented as Because f2(1, X1) = 1 for X1 ∈ {0, 1}, we have Moreover, due to f2(0, 1) = 0, there is Remark 1. Theorem 1 implies that considering canalizing property of a Boolean function, the corresponding logical matrix can be expressed with fewer binary parameters. For instance, if a Boolean function .

4.3. Unate boolean functions

The behavior of some substances or genes are well studied and it is known that they act as suppressing or activating in all reactions they are involved. If they always act inhibiting they have the so called negative unateness property. For the case that a quantity exclusively induces the expression of another gene or substance it has the positive unateness property (Porreca et al., 2010). For the mathematical modeling of the unatess properties let us consider another important type of Boolean functions, which is called the unate function (Breindl et al., 2013). Definition 3. (Breindl et al., 2013) A Boolean function is unate in x, if for any it holds for positive unateness that or it always holds for negative unateness that In the same way as Breindl et al. (2013), unateness can be equivalently represented as linear formulation. Afterwards, this linear formulation can be seen as additional inequality constraints in the optimization problem. As Boolean function can be rewritten as a vector form (4) and according to Lemma 1, there is where S is the logical matrix corresponding to the Boolean function f. Hence, f(⋯, X, 0, X, ⋯) and f(⋯, X, 1, X, ⋯) can, respectively, be represented in a vector form as and Furthermore, based on the vector form of Boolean variable (3) and according to (13) or (14), for each x1, x2, ⋯, x, x, ⋯, x ∈ Δ2 an inequality can be set up. Putting all inequality constraints together, we can find a matrix A for the following expression. Example 5. Consider the Boolean function x1 = f1(x2), this function f1 is defined by two unknown parameters θ1 and θ2. Assume that the Boolean function f1 is a unate function with respect to x2, which satisfies (13). As the first step, the matrix and are calculated, which yields Then, the inequality constraint is

4.4. Known attractors or limit cycles

When the BCN is not perturbed for a sufficiently long time it reaches the steady state. The steady state of a BCN can be exactly one state (i.e., attractor) or a fix cycle of some states (i.e., limit cycle). Attractors or limit cycles are assumed to determine the phenotype in the cell differentiation (Huang and Ingber, 2000). The experimental setup to measure the steady state of a system is simpler and measurements are easier to reproduce compared with transient dynamics. As a result, the steady state of the BN is often already known when the perturbation experiments for identification of the transient behavior are carried out. This knowledge can be utilized as follows. An attractor corresponds to a self loop in the reachability graph. For a given input combination this fixes one specific coulumn in the matrix L. For the constant input and the constant state the k-th column is known to be with k = (i − 1)2 + j. A limit cycle can be analyzed in a similar manner. For the given state sequence of the limit cycle of length T and the constant input one can calculate T columns of L. For each time instant t of the cycle the actual state and the next state is known. The information of this known transition is used by setting the k-th column to with k = (i−1)2 + j.

5. Identification approach

In this part, the identification problem of BCNs will be studied. At first, it will be shown that the identification problem can be reformulated as a nonlinear pseudo-Boolean optimization problem by applying the idea of the prediction error method. The pseudo-Boolean optimization can be transformed into an equivalent linear binary integer programming problem that can be solved more efficiently. Then, we give a way to deal with missing measurement values. Finally, we discuss how dependencies between measured substances can be handled.

5.1. Optimization problem

The prediction error method (PEM) is one of the most widely used identification methods (Isermann and Münchhof, 2011). The basic idea behind this method is to choose parameters to make the difference between a prediction based on the model and the measured values as small as possible. As the PEM minimizes the prediction error in the identified system, errors in the data set due to noise need no special treatment. Obviously the more noise is expected in the data set the more data should be acquired for identification of a reliable model. Before applying PEM, it is necessary to specify a measure of prediction error. In information theory, the Hamming distance d(X, Y) between two vectors is defined as the number of positions, in which the entries differ (Hamming, 1950). As each entry in the vectors X and Y belongs to the Boolean domain {0, 1}, (19) can be equivalently written as Furthermore, let x, y be, respectively, the vector form of [X] and [Y]. Then, it is straightforward to get Based on this, the Hamming distance d(X, Y) can be rewritten as Assume that the observed input and state data is {(U(t), X(t)), t = 0, 1, ⋯, T}. The vector form of the input data {U1(t), U2(t), ⋯, U(t)} and state data {X1(t), X2(t), ⋯, X(t)} are denoted, respectively, as u1(t), u2(t), ⋯, u(t) and x1(t), x2(t), ⋯, x(t). Since the logical matrix S for the state variable X can be represented by the parameter vector θ, we simply denote them as S(θ). Suppose that the state variable X can be influenced by the variables X, X, ⋯, X. According to (5), it is easy to get expression of the prediction : Recalling (21) and (22), the PEM method will estimate the binary parameters by minimizing the prediction error, i.e., Furthermore, the optimization problem (24) can be equivalently rewritten as which is actually equivalent to Next, it will be shown that the optimization problem (25) can be formulated as a pseudo-Boolean optimization (i.e., optimization of pseudo-Boolean functions). A pseudo-Boolean function is a mapping from a finite number of Boolean variables to a real number and can be uniquely represented by a multi-linear polynomial (Boros and Hammer, 2002). As mentioned before, any logical matrix can be expressed by a multi-linear polynomial. After calculation, the term can be represented by a multivariate polynomial. where for and the factor is a natural number. In addition, using the property of Boolean variables , the multivariate polynomial (26) is easily transformed into a multi-linear polynomial. Consequently, the term can be described by a multi-linear polynomial (6) and the optimization problem (25) is transformed into a pseudo-Boolean optimization problem So far, several different ways to handle the nonlinear pseudo-Boolean optimization problems (27) exist, such as reduction to an equivalent linear or quadratic binary programming problem, branch-and-bound method, linear approximations (Boros and Hammer, 2002; Crama and Rodrí-guez-Heck, 2017). As the linear programming relaxation of an integer linear program can be solved efficiently and based on the solution integer solutions can be found, in this paper we consider “linearization”, so that nonlinear binary optimization can be reduced to integer linear program (Crama and Rodrí-guez-Heck, 2017). The key is to introduce auxiliary Boolean variables to replace the nonlinear monomial in (6) by means of the AND-expression . Simultaneously to satisfy the AND-expression, linear inequalities as constraints are considered to get feasible value of the nonlinear monomial . Finally, an optimization problem equivalent to (27) is obtained as The constraints in the optimization problem in (27) can be complemented by additional constraints representing the prior knowledge of alternative hypotheses or unateness as shown in Section 4.1 and Section 4.3, respectively. Remark 2. It is important to note that minimizing or maximizing a pseudo-Boolean function is known to be .

5.2. Handling of large scale networks

With modern measurement techniques it is possible to quantify a huge amount of substances simultaneously. A Boolean network which describes the observed interactions is then also of large scale. But the number of substances which are direct relevant for the regulation of certain substance is usually limited, in other words the connectivity inside the network is bounded. For instance, as pointed out by Arnone and Davidson (1997), the connectivity is bounded by 8. Without prior knowledge the complexity of the algorithm is as all state and input combinations have to be considered as potential regulators for all states, even though only some of them are relevant in the end. This would limit the applicability of the approach to rather small networks. If one has hypotheses about potential interaction partners and the number of potential regulators per state is limited by a set of k variables, then the complexity of the algorithm is , as the regulative functions for each state can be inferred separately. The hypotheses for the interaction partners are not necessarily based on prior-knowledge, but could also be computed based on the data set. In Margolin et al. (2006) an approach is presented, which is based on the information theoretic concept of mutual information ranking and the restriction to pairwise interactions that leads to a very good scaling with big data sets.

5.3. Handling of missing measurement values

Dependent on the measurement technique it is sometimes not possible to measure all states at all time instances and the missing values must be handled in the data analysis. There are approaches in the literature to compute an imputation e.g., for microarrays in Gan et al. (2006) and gel-based proteomics in Albrecht et al. (2010). These approaches are based on interpolation or heuristics. An alternative is to use a data analysis approach which can deal with incomplete data matrices. A missing measurement value can be estimated during the identification by adding additional binary parameters in the identification process. Because of vector expression of states, all possible states belong to the set . In this way, n binary parameters are enough for vector expression of a completely unknown state at time k. For example, if n = 2, then we can generally express the unknown state as Furthermore, as the states of the system are known partially, then the number of binary parameters can be reduced accordingly. So for each missing value one parameter is added to the optimization and the imputation for this value is calculated which fits best to the other dynamic behavior of the system.

5.4. Handling of unmeasurable processes

In some systems post transcriptional protein-protein interactions induce dependencies between the measured abundances similar to the transcriptional regulation. This leads to the situation that the transcriptional regulation can not be observed directly and the identification procedure needs to be adapted accordingly (Geier et al., 2007). The dependencies between the states and the measured outputs can be included in boolean models easily by adding Boolean functions mapping from the actual stats X(t) to the measured outputs Y(t): where is the output vector at time t, h are logic functions. All structural information on the logic functions can be expressed with a logical matrix H which can be derived analogous to Equations (2–5). All approaches presented in this paper can be extended for the BN model with output mapping. As additional logic functions are to be identified, additional unknown parameters are added and these parameters cannot be separately identified from the parameters of the regulative functions, which impacts the computational burden drastically (Zhang et al., 2017b).

5.5. Influence of noise

In real world experiments measurement noise is unavoidable. With a sophisticated binarization method the influence of additive noise can often be suppressed (Hopfensitz et al., 2012). But noise can still lead to wrong binarized values in some cases and consequently errors in the input to the identification method cannot be totally avoided. As the presented approach is based on an optimization, the network which optimally fits to the observed data is found. Inconsistent transitions caused by noise in the data set can be handled directly and lead to an identification result with a non-zero prediction error. If, due to noise, the observed transitions would lead to an identification result which is contradictory to prior knowledge, the identification approach ignores these transitions directly. Example 6. Consider the BCN for oxidative stress response pathways with the PI3-Kinase-Akt pathway given in Sridharan et al. (2012). In the model, X1 represents stress reactive intermediaries, X2 transcription factor A, X3 key protein, X4 protein kinase, X5 transcription factor B, X6 anti-stress response element, U stress signal. Using STP, (32) can be converted into the algebraic form (5) with . Assume that two experiments have been executed starting in steady state with two different stimuli, the corresponding input-state data is obtained as shown in Figures 2A,B. Assume further that as prior knowledge the candidates of regulative interactions (see the dashed lines in Figure 3A) and the attractor are given. The attractor of the BCN without stress is X1 = 0, X2 = 1, X3 = 1, X4 = 0, X5 = 0, X6 = 0.

Figure 2

Perturbation and state measurement. (A) First experiment. (B) Second experiment.

Figure 3

Hypothesis, partially identified and fully identified network graph. (A) Hypotheses for regulative interactions. (B) Identified Boolean network without canalizing information. (C) Identified Boolean network with prior knowlege.

Perturbation and state measurement. (A) First experiment. (B) Second experiment. Hypothesis, partially identified and fully identified network graph. (A) Hypotheses for regulative interactions. (B) Identified Boolean network without canalizing information. (C) Identified Boolean network with prior knowlege. Based on the candidates of regulative interactions, the number of unknown binary parameters θ representing the logical matrices of the Boolean functions can be reduced from 6·27 = 768 to 40 as described in Section 4.1. For instance, since the variable X2 is connected with the variables X1, X3 and X5, it means that the Boolean function of the variable X2 can be described by f2(X1, X5, X5). Accordingly, 8 binary parameters are enough to represent the logical matrix S2 of the Boolean function f2, i.e., The information about the steady state is used as described in Section 4.4 to determine one parameter in each matrix, which reduces the number of unknown variables to 34. In the next, we apply the proposed approach to identify the model of the BCN from the given input-state data. Solving the optimization problem (28), in total, 31 unknown binary parameters can be determined. The identification result is depicted in Figure 3B and the identified matrices are as follows, It can be seen that the logical matrices of the Boolean functions for X5 and X6 can not be uniquely determined. Combined with an additional information about activating or suppressing properties of the states, for instance, X4 and X5 are, respectively, activator to X5 and X6, the complete model can be uniquely reconstructed. The canalizing property of X4 and X5 can be utilized as described in Section 4.2. If this information is not available, one could conduct additional experiments with different stimuli and combine the data to have full reconstruction of the model as depicted in Figure 3C.

6. Discussion

The proposed method facilitates the incorporation of various types of prior knowledge. The optimization problem can be solved by efficient linear programming solvers. By using the simplex method one can guarantee to find the network which optimally fits to the observed data. In comparison, the genetic algorithms based approaches may not guarantee the optimal solution. The proposed method is developed for synchronous Boolean networks. It can be applied to large scale networks, if the connectivity of the network to be identified is limited with aid of prior knowledge or application of information theory. In future we plan to investigate data-based approaches to infer the connections in large networks and automated partitioning into smaller subsystems (e.g., with an adapted approach from discrete event systems like Saives et al., 2018). We also work on a new method for the binarization based on the idea that the qualitative system behavior before and after the binarization shall be the same.

Author contributions

TL, ZZ, and PZ conception and design of research. TL and ZZ performed simulation and analyzed data. TL, ZZ, and PZ interpreted simulation results. TL and ZZ prepared figures. TL, ZZ, and PZ drafted manuscript and approved final version of manuscript.

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

31 in total

1. Identification of genetic networks from a small number of gene expression patterns under the Boolean network model.

Authors: T Akutsu; S Miyano; S Kuhara
Journal: Pac Symp Biocomput Date: 1999

Review 2. Shape-dependent control of cell growth, differentiation, and apoptosis: switching between attractors in cell regulatory networks.

Authors: S Huang; D E Ingber
Journal: Exp Cell Res Date: 2000-11-25 Impact factor: 3.905

3. Random Boolean network models and the yeast transcriptional network.

Authors: Stuart Kauffman; Carsten Peterson; Björn Samuelsson; Carl Troein
Journal: Proc Natl Acad Sci U S A Date: 2003-12-01 Impact factor: 11.205

4. Constructing logical models of gene regulatory networks by integrating transcription factor-DNA interactions with expression data: an entropy-based approach.

Authors: Guy Karlebach; Ron Shamir
Journal: J Comput Biol Date: 2012-01 Impact factor: 1.479