MOTIVATION: Major disorders, such as leukemia, have been shown to alter the transcription of genes. Understanding how gene regulation is affected by such aberrations is of utmost importance. One promising strategy toward this objective is to compute whether signals can reach to the transcription factors through the transcription regulatory network (TRN). Due to the uncertainty of the regulatory interactions, this is a #P-complete problem and thus solving it for very large TRNs remains to be a challenge. RESULTS: We develop a novel and scalable method to compute the probability that a signal originating at any given set of source genes can arrive at any given set of target genes (i.e., transcription factors) when the topology of the underlying signaling network is uncertain. Our method tackles this problem for large networks while providing a provably accurate result. Our method follows a divide-and-conquer strategy. We break down the given network into a sequence of non-overlapping subnetworks such that reachability can be computed autonomously and sequentially on each subnetwork. We represent each interaction using a small polynomial. The product of these polynomials express different scenarios when a signal can or cannot reach to target genes from the source genes. We introduce polynomial collapsing operators for each subnetwork. These operators reduce the size of the resulting polynomial and thus the computational complexity dramatically. We show that our method scales to entire human regulatory networks in only seconds, while the existing methods fail beyond a few tens of genes and interactions. We demonstrate that our method can successfully characterize key reachability characteristics of the entire transcriptions regulatory networks of patients affected by eight different subtypes of leukemia, as well as those from healthy control samples. AVAILABILITY: All the datasets and code used in this article are available at bioinformatics.cise.ufl.edu/PReach/scalable.htm.
MOTIVATION: Major disorders, such as leukemia, have been shown to alter the transcription of genes. Understanding how gene regulation is affected by such aberrations is of utmost importance. One promising strategy toward this objective is to compute whether signals can reach to the transcription factors through the transcription regulatory network (TRN). Due to the uncertainty of the regulatory interactions, this is a #P-complete problem and thus solving it for very large TRNs remains to be a challenge. RESULTS: We develop a novel and scalable method to compute the probability that a signal originating at any given set of source genes can arrive at any given set of target genes (i.e., transcription factors) when the topology of the underlying signaling network is uncertain. Our method tackles this problem for large networks while providing a provably accurate result. Our method follows a divide-and-conquer strategy. We break down the given network into a sequence of non-overlapping subnetworks such that reachability can be computed autonomously and sequentially on each subnetwork. We represent each interaction using a small polynomial. The product of these polynomials express different scenarios when a signal can or cannot reach to target genes from the source genes. We introduce polynomial collapsing operators for each subnetwork. These operators reduce the size of the resulting polynomial and thus the computational complexity dramatically. We show that our method scales to entire human regulatory networks in only seconds, while the existing methods fail beyond a few tens of genes and interactions. We demonstrate that our method can successfully characterize key reachability characteristics of the entire transcriptions regulatory networks of patients affected by eight different subtypes of leukemia, as well as those from healthy control samples. AVAILABILITY: All the datasets and code used in this article are available at bioinformatics.cise.ufl.edu/PReach/scalable.htm.
Major disorders, such as cancer, have been shown to alter the transcription of a large number of genes and thus affect the mechanism that governs cells functions (Krivtsov, 2009; Valk ). Many complex disorders, such as acute lymphoblastic leukemias, however, yield a varying spectrum of expression profiles and, as a result, cannot be robustly characterized by merely studying the gene expressions (Armstrong, 2002).An important part of cell biology research is the study of the causal relationship between extracellular conditions and the cell response. Such causality is governed by a chain of biochemical reactions through which extracellular signals are transmitted from membrane receptors to transcription factors (i.e., reporters) via protein–protein interactions (Bu and Callaway, 2011). While the pattern of this mechanism is similar for all organisms, important variations in its quantitative aspects such as gene expressions result from external perturbations, differentiation stage of the cell, timing of DNA replication and various epigenetic mutations (Los ; Mattick ). Therefore, detecting these quantitative variations is an important source of information for assessing the fitness of the organism and ultimately for diagnosis and prognosis.Extensive evidence suggests that there is a degree of uncertainty in our knowledge of interactions within cells (Bader ; Ceol ; Deng ; Ourfali ; Sharan ; Suthram ; Szklarczyk ; von Mering ). The source of this uncertainty is 2-fold. First, the biological processes that are modeled as protein interactions in biological networks are stochastic events (Bader ). Second, the evidence in support of an interaction is not entirely decisive for the actual presence of the interaction (Bader ; Ourfali ; Sharan ; Shlomi ) due to many reasons, such as epigenetic variations across different cells (Gerstein ). Several schemes have already been proposed to assess the reliability of protein interactions in the form of confidence values (Bader ; Deng ; Suthram ). Such interaction confidence values are now available in large biological network databases, such as MINT (Ceol ) and STRING (Szklarczyk ).Recent studies often model the uncertainty of the interactions in biological networks using probabilistic networks (Gabr ; Todor ; Todor ). We adopt the same model in this article, namely, each node of the network denotes a gene and the directed edge from a node v to node v denotes that the gene corresponding to v can regulate the gene denoted by v through an interaction. Each edge in this network is a probabilistic event. That is, it is considered possible, but not certain, reflecting the insecure knowledge of the gene regulation process. A common way to model the uncertainty of each edge is to associate it with a probability value, which is computed for each interaction from several factors: gene expressions, available evidence for it and network topology around it (Sharan ).The ability to compute confidence values for interactions provides opportunities to model and study biological networks accurately. It, however, comes at a high price as the uncertainty of the topology of interactions makes studying biological networks a computationally challenging task. The challenge is that a probabilistic network represents a large number of alternative deterministic network topologies. More precisely, a network with n probabilistic edges yields 2 possible network configurations, as each one of the n edges may be present or absent. For instance, in Figure 1, the probabilistic network shown on top corresponds to 16 deterministic networks since it contains 4 probabilistic edges.
Fig. 1.
A probabilistic network (top) and two of the deterministic networks corresponding to it (bottom). Each of the deterministic networks is obtained from the probabilistic network with some probability determined by the probabilities of the edges that are included or not in the deterministic network. p denotes the probability of edge e being present. q = 1– p is the probability of the edge being absent. The expression above each deterministic network is the probability of observing it
A probabilistic network (top) and two of the deterministic networks corresponding to it (bottom). Each of the deterministic networks is obtained from the probabilistic network with some probability determined by the probabilities of the edges that are included or not in the deterministic network. p denotes the probability of edge e being present. q = 1– p is the probability of the edge being absent. The expression above each deterministic network is the probability of observing itIn this article, we address the problem of characterizing the signaling reachability in transcription regulatory networks (TRNs). Unlike most of the existing literature, we eliminate the limitations of the classical assumption that all interactions are deterministic and adopt the more descriptive probabilistic network. More specifically, given a set of source genes and a set of target genes , we compute the reachability profile of that network as a doubly indexed vector R where, for all i, j such that , the entry R[i,j] is the probability that a signal originating at s can reach t (i.e., s regulates t). We show that the reachability profile can help us understand how different disorders that alter the cellular functions based on the signaling patterns of the gene regulatory networks. We particularly focus on leukemias, which is challenging due to the heterogeneity of the transcription patterns.Summary of related work. The problem of computing reachability in uncertain network topologies has drawn significant attention in the context of network reliability. Various exact methods, as well as approximate methods, have been proposed. We refer interested readers to several surveys on the topic (Aggarwal ; Hwang ). Theoretical results on the complexity of the problem reveal that it is #P complete (Brown and Colbourn, 1996; Husfeldt and Taslaman, 2010; Provan and Ball, 1983). The problem is significantly simplified in the case of acyclic graphs. This type of graphs can be represented as Bayesian Networks, for which various inference algorithms exist. However, for this simple case sophisticated inference algorithms are unnecessary. In the context of biological networks, the problem for general graphs was first addressed by Ourfali . The goal of these authors was to infer the structure of the signaling network that best explains a set of gene knockout pairs, given a protein–protein interaction network. To achieve this goal, they developed a method to compute the reachability probability for each knockout gene pair. Their method is an exact solution based on the inclusion–exclusion principle (van Lint and Wilson, 1992). However, due to its high time complexity, this method works accurately only for very small networks (i.e., those with a few tens of nodes). PReach (Gabr ) computes the exact reachability probability based on polynomial multiplication. It is significantly faster than the inclusion–exclusion method of Ourfali for networks where there are many paths. However, it does not scale to large networks. Thus, the existing solutions cannot be used to study entire TRNs, and there is a great need for accurate yet efficient methods.Contributions. Here, we develop a novel method that computes the probability that a signal originated at a given source gene can reach to a given target gene in a given probabilistic network. Unlike existing methods, our solution is both precise (i.e., it computes this probability without error) and it scales to large networks. Our method follows a divide-and-conquer approach. We partition the given probabilistic network into a sequence of loosely connected clusters of nodes. On the boundary between two such consecutive clusters lies a set of nodes called node separators. Any signal which originates from the source node and arrives at any node in the latter cluster must visit the node separators. Similar to PReach (Gabr ), we model the given probabilistic network using polynomials. The form of the polynomials of our method however differs from that of PReach in a way that allows us to collapse the polynomial to very small size that is determined by the size (number of interactions) of the clusters and the number of nodes in a given boundary. Each term in our polynomial evaluates the existence probability of a collection of subsets of interactions. In brief, instead of computing the reachability probability from the given source node to the target node, we incrementally compute the reachability probability from the source node to each node separator in sequential order. That allows us to avoid storing a massive fraction of terms of the polynomial (i.e., the terms corresponding to the nodes in earlier clusters). Our experimental results on real and synthetic datasets demonstrate that our method scales to very large network sizes while the inclusion–exclusion method (Ourfali ) and PReach (Gabr ) fail. We also observe that the reachability profiles provide a valuable resource for characterizing leukemias and differentiating the centrality of the genes across different leukemias as well as healthy control groups.In summary, the key contributions of this work are:We introduce a new quantity for evaluating the state of a biological network, the reachability profile.We introduce a novel, fast and scalable method to compute the reachability profile of large networks, based on polynomials and polynomial collapsing operators.We demonstrate the usefulness of reachability profiles in detailed analysis of different types of leukemias.The rest of the article is organized as follows. Section 2 describes our method. Section 3 presents our experimental results. Section 4 concludes with a brief discussion.
2 METHOD
In this section we present our method in detail. We first define the essential theoretical concepts in Section 2.1. We then present an overview of our method in Section 2.2. We discuss how to compute intermediate reachability probabilities in Section 2.3. We elaborate on how to partition the network in Section 2.4.
2.1 Preliminary definitions
We start by formally defining the probabilistic network concept. We provide a list of notations used throughout the article in the Supplementary Material.
Definition 1 (probabilistic network)
A probabilistic network is a graphIn our context, each node in V represents a gene, each edge in E represents an interaction between two genes and P associates to each edge the probability of the existence of the interaction it represents. For instance, in Figure 1 (top figure), V = {a,b,c,d} and E = {e1,e2,e3,e4}. We assume that each edge exists independent of all other edges. This assumption is commonly used in the literature for similar problems (Ceol ; Szklarczyk ). We limit our description to directed networks, although undirected networks can be dealt with by replacing each undirected edge with two edges in opposite directions.Given a probabilistic network , we call the deterministic network G = (V,E) the maximal deterministic network of . In other words, the maximal deterministic network is the deterministic network in which all possible interactions of are present.The computational problem we address in this article is: given a probabilistic network, , a source node and a target node , what is the probability that the t can be reached from s?Next, we define a graph concept, node separator, which will help us in explaining our method.
Definition 2 (Node Separator)
Let G = (V,E) be a deterministic network and
be two of its nodes. An s-t node separator of G is a set of nodes
whose removal disconnects t from s in G.Figure 2 illustrates this concept. Here, the source node s and the target node t are labeled with 1 and 8, respectively. The set of nodes {4, 5} is an s-t node separator. We say that a node separator is minimal if none of its proper subsets is a node separator.
Fig. 2.
A network with an s-t node separator. The source is node 1 and the target is node 8. The dotted rectangle indicates an s-t node separator
A network with an s-t node separator. The source is node 1 and the target is node 8. The dotted rectangle indicates an s-t node separatorA node separator partitions the nodes of that network into three disjoint subsets:The left nodes are the nodes that are reachable from the source, but the target cannot be reached from any of them without going through the node separator (e.g., nodes 1, 2 and 3 in Figure 2, for the node separator {4, 5}).The node separator itself (e.g., nodes 4 and 5 in Fig. 2).The right nodes are the remaining nodes (e.g., nodes 6, 7 and 8 in Fig. 2). Notice that these are the nodes from which the target can be reached, but they are not reachable from the source without passing through the node separator.A node separator K also partitions the edges of the given network into three subsets:The left edges are the edges between the nodes in the union of left and separator nodes (e.g., edges e1,e2,e3,e5 and e6 in Fig. 2). We denote the set of left edges with L(K).The right edges are the edges between right nodes or from a separator node to a right node (e.g., edges e7,e9,e10,e11 in Fig. 2).The backward edges are the edges from right nodes to the separator nodes or from right nodes to left nodes (edges e4,e8 in Fig. 2).
Theorem 1
Let G = (V,E) be a deterministic network. Given two nodes, , let K be an s-t node separator. For any right node u of K, it is guaranteed that K is also an s-u node separator.We prove Theorem 1 in the Supplementary Material.If a node separator does not yield backward edges, we call it a good node separator. We only use good node separators in the rest of the article. So, in what follows by node separator we refer to a good node separator, unless otherwise specified. Finally we define the concept of subset reachability in probabilistic networks.
Definition 3 (Subset Reachability)
Let
be a probabilistic network. Let s and t () be source and target nodes in . Consider two s-t node separators K
such that for all nodes
is a right node of K
is reachable from any node in S. We denote the probability that K
2.2 Overview of the method
Our method works in two steps.
The case i = 0 is a special one. Since K0 contains s, we have T = {s}. Thus the probability to reach the source node is 1. Following from Equation (1), our algorithm iteratively computes p(K0,K+1,K+1) by moving from one node separator to the next, starting from K0.Given a probabilistic network and source and target nodes s and t, in the first step, we partition into a sequence of subnetworks that are connected to each other through node separators. In general terms, let us denote the sequence of node separators with K0, , where K0 = {s} and K+1 = {t}. We choose these node separators such that , for all nodes is a right node of K. Following from Definition 3, the problem we solve in this article is equivalent to computing .At this step we compute the reachability probability from s to t. More specifically, using this notation above, for any i (0 < i ≤ c), we write the probability p({s},T,K+1) asThe correctness of Equation (1) follows from the definition of node separator and Theorem 1. More specifically, in order to reach to any node in , we have to visit at least one node in K. The product p({s},S,K)p(S,T,K+1) in Equation (1) is the probability that a signal reaches T by visiting all the nodes in S and no other node in K– S. The summation in this equation enumerates all possible subsets . Thus, it is accumulates the probability of all possible alternative routes from s to T defined by all possible subsets S.Figure 3 illustrates our method. In this example, the set of edges in E is split into three non-overlapping sets using four node separators K0, K1, K2 and K3 where K0 = {s} and K3 = {t}. These sets are , and . Each of these sets define a subnetwork of . Once the network is partitioned this way, instead of computing the reachability probability directly from s to t, we compute it incrementally by advancing from one node separator to the next. For the example in Figure 3, we first consider the separator K1, then K2, finally K3. At each node separator, we only consider the subnetwork which contains the left edges of that separator.
Fig. 3.
A hypothetical network with two disjoint s-t node separators, K1 = {2,3} and K2 = {4,5,6}. Source and target nodes are labeled with s = 1 and t = 7. For uniformity, we consider K0 = {1} and K3 = {7} also to be node separators
A hypothetical network with two disjoint s-t node separators, K1 = {2,3} and K2 = {4,5,6}. Source and target nodes are labeled with s = 1 and t = 7. For uniformity, we consider K0 = {1} and K3 = {7} also to be node separatorsTo understand Equation (1) better, consider the separators K1 and K2 in Figure 3. There are three possible scenarios to reach to a subset T of K2, say T = {4} from s = 1. Each of these scenarios corresponds to a nonempty subset of K1 = {2, 3}.
The sum of the three probabilities above yields the probability p({1},{4},K2).Visit S = {2} and do not visit {3}. This happens with probability p({1},{2},K1)p({2},{4},K2).Visit S = {3} and do not visit {2}. This happens with probability p({1},{3},K1)p({3},{4},K2).Visit both nodes in S = {2, 3}. This happens with probability p({1},{2,3},K1)p({2,3},{4},K2).In Section 2.3, we explain how we compute Equation (1) efficiently for i > 0 (Step 2). In Section 2.4, we explain how we choose the node separators (Step 1).
2.3 Computing the reachability probability
In Equation (1), we presented an iterative formula to compute the reachability probability p({s},{t},{t}) by splitting the network using cuts K0, K1, … , K+1. Although this equation reduces the scale of the problem to the subnetworks between consecutive cuts, computing it efficiently still remains to be a challenge. Here, we describe how we compute this probability efficiently yet provably correctly. More specifically, given two consecutive node separators K and K+1 (0 < i ≤ c) and given p({s},S,K) for all subsets , we discuss how we compute p({s},T,K+1) for all subsets .From the definition of left edges, we know that the probability p({s},S,K) depends only on the edges in L(K). This is because L(K) contains all the edges that can lie on a path from s to any node in K. Let us denote the set of edges in with E for any 0 < j (i.e., left edges of K which are also right edges of K– 1). Thus, the probability p({s},T,K+1), depends only on the edges in E+1 when p({s},S,K) is given for all S. This implies that it is possible to compute the probability p({s},T,K+1) by considering only the edges in E+1 when p({s},S,K) is known . Below, we compute this probability by transforming the probabilistic network into a collection of polynomials.Transformation into polynomial space. Assume that the given probabilistic network, , contains n edges and m nodes, denoted with E = {e1,e2, … e} and V = {v1,v2, … v}, respectively. As the first step of the transformation, we associate to each edge a polynomial called the edge polynomial. More precisely, for edge , let p = P(e) and q = 1– p denote the existence and absence probability of e, respectively. We define the edge polynomial of e as the first degree polynomial of two variables, x and y, F(x,y) = p + q.Consider a subset E ′ of the edges in E. We define the edge aggregation polynomial for E ′, denoted with F(E ′), as the product of all the edge polynomials associated with the edges in E ′:
Notice that each term in the summation above corresponds to one of the possible deterministic configurations for the network topology. The coefficient of the term in F is the probability of observing all the edges in and not observing any edge in . To understand this better, consider the network in Figure 1 (network on the top). In the edge aggregation polynomial of this network, the term x3x4y1y2 corresponds to the deterministic instance where only edges e3 and e4 are present (i.e., bottom left network in Fig. 1). The coefficient of this term is q1q2p3p4 which is the probability of observing that network instance.Reachability in polynomial space. As we explain in Equation (2), the terms of the edge aggregation polynomial represent different deterministic network configurations. Thus, the probability p({s},T,K+1) is equal to the sum of the coefficients of a specific subset of the polynomial terms: The terms which yield a topology where K+1 is T-reachable from {s}.At this point, the polynomial transformation seemingly makes the reachability problem as complicated as exhaustively enumerating all network topologies. This is because, (i) the edge aggregation polynomial has as many terms as the number of network topologies; and (ii) finding the subset of polynomial terms which yield the desired topologies will incur additional computational cost. Below, we build a novel algebra on the edge aggregation polynomial to compute this value by enumerating only a tiny fraction of the polynomial terms.Algorithm 1 presents a pseudocode that describes our algorithm for constructing the polynomial needed to compute p({s},T,K+1). The algorithm takes the existing edge aggregation polynomial for the edges in L(K) as input. At each iteration it grows that polynomial by aggregating it with the edge polynomial of a new edge in E+1 (Step 2). It then reduces the size of the resulting polynomial by collapsing it (Step 3). Briefly, the collapsation step merges all terms which correspond to configurations in which K+1 is T-reachable from s, for each possible subset T of K+1, into a single term by replacing the variables in these terms with a single variable denoted with z. Thus, the coefficient of z is the sum of the coefficients of the original terms that were collapsed. In the rest of this section, we elaborate on these steps, particularly the collapsation step.Require: Probabilistic graphRequire: Node separators K and K+1Require: Edge aggregation polynomial F ′= F(L(K)).1: for all
do2: Aggregate edge polynomial of e as F ′= F ′× F(x,y)3: Collapse F ′4: end forWe start by introducing some notation which will simplify our polynomial algebra below. For a subset of edges , we denote the set of indices of the edges in by . For instance, for , we haveLet us denote the subset of edges of E+1 which have been multiplied into the edge aggregation polynomial so far with and its set of indices with .Following from Equation (2), since and L(K) are disjoint, we can write the edge aggregation polynomial of the edge set as . To simplify our notation of , for all , we denote and with and , respectively. We denote the coefficient of with α. Thus, we can write the first polynomial as .For each node separator K, we define a unique collapsing operator and denote it with ρ(). This is a linear operator; it acts on the terms of the given edge aggregation polynomial for the edges in L(K) independently. Briefly, the collapsed polynomial contains a new variable, z, for each subset S of K. The form of this polynomial is . In this representation, z corresponds to the case where K is S-reachable from K0 (i.e., the original source node), and the coefficient β is the probability of observing that case. In other words β is equal to p({s},S,K) in Equation (1). We explain how this operator works and how we compute it in detail later in this section. For the moment, assume that we have already applied it for the edge set L(K). Therefore we replace the polynomial, F(L(K)) in the product with its collapsed version, denoted by ρ(F(L(K))).After multiplying the first polynomial and the collapsed version of the second polynomial, we get
Since this product includes edge polynomials from the edge set , we further reduce its size by applying the collapsing operator ρ+1() on it and thus obtain .Next, we explain how the collapsing operator works. Given two nodes , let π be a path from u to v in the maximal deterministic network G = (V,E) of the given probabilistic network. Here, by path we mean the set of edges traversed to reach from u to v. Let I be a subset of indices, . We define two set indicator functions χ,() and ω,() for the node pair (u, v). The first one takes the value χ,(I) = 1 if there is a path π from u to v such that and 0 otherwise. For instance, in Figure 2, χ1,8({1,2,5,6,7,10,11}) = 1. This is because {e2,e5,e6,e7,e10} forms a path from 1 (source) to 8 (target) and its set of indices {2, 5, 6, 7, 10} is a subset of the input set {1, 2, 5, 6, 7, 10, 11}. Similarly, the second indicator function takes the value ω,(I) = 1 if there is a minimal u- v cut κ such that and 0 otherwise. For example, ω1,8({2,3,4,5}) = 1, because {e3,e5} forms a minimal cut between nodes 1 and 8 and its set of indices {3, 5} is a subset of input set {2, 3, 4, 5}.Next, we extend the definitions of the set indicator functions χ and ω to multiple source nodes. The extended function χ,(I) evaluates to 1 if there is a path π from at least one node u in S to v such that and 0 otherwise. Similarly, ω,(I) evaluates to 1 if for all nodes there is at least a minimal u- v cut κ such that and 0 otherwise. Formally, we compute these functions as
Next, we formalize T-reachability of the node separator K+1. For this purpose, we define a new set indicator function C,() which evaluates to 1 only if K+1 is T-reachable from S. Otherwise, it evaluates to 0. We compute this function as
We prove the correctness of Equations (4) and (5) in the Supplementary Materials.Now we are ready to put all the pieces together and compute the collapsing operator ρ+1. Recall that each term of the given edge aggregation polynomial indicates a deterministic subnetwork topology for the edges in , combined with all deterministic topologies of the edges in L(K) in which K is S-reachable from K0, for every . If that combination ensures that K+1 is T-reachable from K0, then the collapsing operator ρ+1 replaces all the variables of that term with z. More specifically, consider a term in Equation (3) after the product has been expanded, in the form , where . We compute the collapsing operator ρ+1() on this term as
The collapsing operator ρ+1() [see Equation (6)] transforms each term of the polynomial into a single term. The resulting term either contains the variable z, where , or remains unchanged. This is because either takes the value 0 or 1. Thus, ρ+1() leaves the term unchanged only if for all T. When, for some , the coefficient of z becomes 0. It returns γ, in this case. Furthermore, from Equation (5), we know that if , then for all T′≠T (), . Thus, the function ρ+1() returns no other term containing variable z′.Now suppose that a term has collapsed to z and a new edge e is added in Step 2 of Algorithm 1. From a polynomial point of view, the z variable will be multiplied with x and y, respectively, resulting in two new terms. From the graph reachability point of view, we know that the edges added prior to e already ensure T-reachability, so e does not make any differece: both its presence and its absence lead to reachable graph configurations. In the polynomial, the coeficients of z and z have to be added together to obtain the reachability probability. To take advantage of this observation, we introduce a special multiplication rule for the z variables: both z and z are replaced with z, for all , so that their coefficients are added together.The collapsing operator is very powerful as it ensures that the size of the edge aggregation polynomial never exceeds in the worst case (i.e., when the indicator function C,() always returns 0 until the last edge in L(K+1) is aggregated). More importantly, it guarantees to reduce the polynomial size down to once the edges in L(K+1) are all aggregated. This is a significant improvement as without the collapsation function, the size of the edge aggregation polynomial after considering K+1 and it goes up to 2|| after including all the edges.So what is the reachability probability? After all the edges in E+1 have been added, all the terms will collapse, and the polynomial will be . When K+1 = {t} is reached, the polynomial will have only two terms: . The coefficient γ{} is equal to the probability that the target node is reachable from the source node. We prove the correctness of our method in the Supplementary Material.
2.4 How to choose node separators
Depending on the topology of the maximal deterministic network there can be many alternative sequences of node separators between the source and target nodes. Regardless of how we choose the node separators, our method guarantees to return the correct result. The node separator choice however can affect the size of the intermediate polynomials and thus the running time of our method in two ways. (i) Ideally, each node separator K should contain a small number of nodes as it will produce variables of the form z. (ii) Each consecutive node separators should contain a small number of edges between them (i.e., E should be small). This is because, in the worst case, they yield terms. Finding an optimal sequence of node separators that minimizes the overall computation time is in itself an intriguing area worth investigating. The right balance between the separator size, the size of the edge sets between the separators and the amount of computation we are willing to spend on finding the solution is hard to find. Here, we use a greedy approach to find good node separators.We consider the first node separator (K0) to be the source node itself. We determine the next node separator from the current one by considering all nodes that are one edge further from the current node separator. The set of nodes identified in this way is a minimal node separator, but it is not necessarily good, because it may contain nodes with incident backward edges—see Section 2.1. To make it good, we first identify the nodes that have incident backward edges and replace each of them with all the nodes that are reachable from them in one hop. Thus we advance the node separator toward the target keeping it minimal, and stop as soon as we encounter a good minimal node separator. This way, we aim to keep the size of E small. We repeat this process to select more good node separators until we reach the target.
3 RESULTS
In this section we experimentally evaluate our method. Section 3.1 presents the datasets and the experimental setup. Section 3.2 examines the running time of our algorithm. Section 3.3 presents the reachability profiles obtained with our method. Section 3.4 evaluates gene centrality based on the reachability profiles. Section 3.5 analyses the stability of the human TRN.
3.1 Datasets and implementation details
We evaluate our method using both synthetic and real biological networks.Synthetic dataset. We generated the synthetic network dataset using the Barabasi–Albert random network model (Barabasi and Albert, 1999). We chose this model because it is the de facto standard for the scale-free networks, which best describe most biological networks (Jeong ; Todor ; Yook ). We created six sets of random networks. In each set, we created 10 networks with the same number of nodes: 50, 100, 150, 200, 250 and 300, respectively. The number of edges is twice the number of nodes in each network.Real dataset. For experimentation on real biological networks we used the human regulatory network of Gerstein . From this network, we selected only the reliable interactions by taking the intersection with those present in the DIP database (Xenarios ). The resulting network has 130 nodes and 172 edges. To assess the interaction confidence for each edge in this intersection, we used the logistic regression method used by Sharan . This strategy is used often in the literature to compute interaction confidence (Bader ; Ourfali ; Sharan ; Shlomi ). We obtained the gene expression data of 575 leukemiapatients from Zhang . We obtained control gene expression data in early progenitor cells from Laurenti . Both control and leukemia expression datasets are normalized using quantile normalization (Amaratunga and Cabrera, 2001). Each leukemia sample in our dataset belongs to one of eight different subtypes of leukemia: hyperdiploid, TCF3-PBX1, ETV6-RUNX1, MLL, Ph, Hypo, T-ALL and Other, or to non-leukemia sample types CD10CD19 and CD34. We do not include samples from the last two categories in our experiments, since they contain only four samples each. We trained eight different logistic regression models, one for each leukemia subtype to compute interaction probabilities for each subtype separately. Also, we classified the early progenitor cell samples into three categories: primitive (hematopoietic stem cells), lymphoid (ETP, MLP, ProB and B_NKpre) and myeloid (the rest of the samples). We trained a different logistic regression model for each type. Thus, we obtained different probability values for the edges of the human regulatory network, depending on the cancer or control group subtype in which the gene expression levels were measured. This in turn results in different reachability probabilities. We identified all the source and all the target genes in our network using the hierarchical decomposition obtained by HIDEN (Gulsoy ). This resulted in 9 source genes and 88 target genes.We used C++, Matlab and R for implementation. We ran our experiments on an AMD Opteron processor with 256 GB of memory and 1.9 GHz speed.
3.2 Evaluation of the running time
In order to evaluate the performance of our method systematically, we ran it on the synthetic networks of different sizes. We measured the running time for each synthetic network and each source–target pair. We have taken each node, in turn, as a source an then as a target. Thus, computing the reachability profile for the largest network size requires 300 × 299 = 89700 reachability probability computations per network. In total, we computed the reachability profile for 10 × 6 = 60 networks, for a total of 2264500 reachability probabilities. In Figure 4, we report the average running time to compute the reachability probability for one source–target pair for each set of networks. We report the average running time over all networks in the set and over all source–target pairs.
Fig. 4.
Average running time of our method on Barabasi–Albert networks for growing network sizes
Average running time of our method on Barabasi–Albert networks for growing network sizesThe figure shows that the running time of our method in a scale-free network grows at most linearly in terms of number of nodes. Even for networks as large as 300 nodes and 600 edges, the average running time of our method per source–target pair remains in milliseconds. This small running time allows us to compute the entire reachability profiles in practical time for a large number of networks, which was not possible before.For comparison, the inclusion/exclusion method (Ourfali ) and PReach (Gabr ) fail to complete execution on the same dataset because they exhausted the 256 GB of memory available in the system even for a single source–target node pair of the smallest network in our dataset.For the real dataset investigated in this article, we computed 11 reachability profiles, one for each leukemia or control group subtype. For each subtype, we computed 9 × 88 reachability probabilities (for 9 sources and 88 target nodes), thus 8712 probabilities in total. Our method computed each of these probabilities in only 2.5 s on the average. Both PReach and the inclusion–exclusion method fail to scale to this network size.
3.3 Reachability profiles in the human TRN
For each leukemia or control network, we computed the reachability probability for each pair of source–target nodes. We call this the reachability profile of the network. In Figure 5 we show the reachability profiles for all leukemia subtypes and control groups in a heat map. Each row in the figure represents a leukemia subtype or a control group, and each column represents a source–target pair. The color intensity at a location represents the reachability probability for that pair. We applied hierarchical clustering on both dimensions based on the reachability profiles. Hierarchical clustering correctly clusters the control groups subtypes together, as well as all the leukemias. This shows that the reachability profile can distinguish between healthy and leukemia cases.
Fig. 5.
Reachability profiles in the human regulatory network. Each row represents a cancer type or a control group. Each column represents a source–target pair. The intensity of each cell represents the reachability probability for that source–target pair—lighter color means higher probability
Reachability profiles in the human regulatory network. Each row represents a cancer type or a control group. Each column represents a source–target pair. The intensity of each cell represents the reachability probability for that source–target pair—lighter color means higher probabilitySource and target gene groups that show a noticeable gap between their reachability probabilities in control versus leukemia cases include SPI1, POU2F2 as sources and TOPBP1, TFDP1, TFDP2, HDAC1, CDK8, REL, RELA and NFKB2 as targets. While these sources and targets have low a reachability probability for control groups, they exhibit a higher range in leukemia subtypes.Our findings resonate with earlier observations. Our method clusters the hyperdiploid and the ETV6-RUNX1 subtypes together, while in (Zhang ), Supplementary Figure S22, a significant number of genes exhibit similar expression levels in these subtypes. They are frequently studied together, as they are both related to a favorable prognosis in children (Liang ; Paulsson ). On the contrary, the Hypo subtype, which is least similar to Hyperdiploid and ETV6-RUNX1 in our results, is associated with poor outcome (Holmfeldt ).To further appreciate the value added by the reachability profiles to our results, we performed another experiment based solely on gene expression data, without taking the regulatory network into account. In this experiment, we clustered the gene expression samples using k-means clustering. We set k = 11, as there are totally 11 subtypes in our dataset. Then, within each cluster, we examined the distribution of each leukemia type. The results are shown in Figure 6. Our results demonstrate that, with the exception of cluster 10, consisting primarily of T-ALL samples, all the clusters are a heterogeneous mix and do not have a definitive dominant leukemia type. Although one cluster consists only of control samples, the control subtypes are mixed together. Furthermore, the myeloid subtype samples are spread out through the rest of the clusters. We conclude that clustering based on gene expression alone is insufficient for classifying leukemia types.
Fig. 6.
Distribution of leukemia subtype and control group samples within clusters obtained from transcription data alone. Each cluster is normalized by the number of samples it contains
Distribution of leukemia subtype and control group samples within clusters obtained from transcription data alone. Each cluster is normalized by the number of samples it containsIn light of these experimental observations, reachability profiles prove to be a reliable and valuable tool for assessing the viability of TRNs working as a whole.
3.4 Gene centrality using reachability profiles
We further illustrate the usefulness of reachability profiles by analysing the centrality of genes based on their contribution toward the reachability profile (Gabr and Kahveci, 2013). For this experiment, we compare the reachability profiles for the original network with the reachability profiles obtained by eliminating one gene from the network. Thus, for each gene, we compute its centrality by comparing the reachability profile for the original network with the reachability profile obtained when the gene is missing. For a given gene g, whose centrality is under consideration, and a given source–target pair, the difference in reachability probability can be seen as the probability that the source–target pair is indispensable for connecting the source to the target; in other words, {g} is a node separator. Then the sum of this value over all source–target pairs is the average number of source–target pairs for which g is indispensable. To formalize this description, let us denote the set of source and target genes with S and T, respectively. We also denote the probability that gene t is reachable from gene s in the original network with p(s,t) and the same probability for the network where gene g is removed with . The centrality of gene g is defined as .Figure 7 plots the centrality values for each leukemia type and each gene. We excluded from the plot the genes having centrality smaller than 1. As expected, only a few genes have a high centrality, which is a characteristic of scale-free networks. We also performed hierarchical clustering of the leukemia subtypes and of the genes based on their centrality. We observe that the most similar subtypes of leukemias are T-ALL and Ph. The Ph subtype is a chromosomal abnormality resulting from the same translocation found in ALL (Talpaz ). The least similar to the first two is Hypo, like in the reachability profiles experiment. TP53 and RB1 are two of the most central genes identified by our method. They are both characterized by alterations in Hypodiploid ALL (Holmfeldt ). We see that the most central gene is E2F1, which a transcription factor known to have a crucial role in cell cycle and tumor suppression (Neuman ). Thus, malfunctioning of this gene severely affects many pathways in the regulatory network. Likewise, the following two reachable genes, MYC and TBP are known hubs regulating important functions. MYC is involved in cell proliferation and its persistent expression is common to many cancers (Nesbit ), while TBP is related to RNA polymerase II, an essential element of DNA transcription initiation (Kornberg, 2007). Among the top genes we identified based on their centrality is also EP300, a histone-modifying gene which was reported to inactivate lesions disrupting hematopoietic development in ETP ALL (Zhang ).
Fig. 7.
Centrality of genes in the human regulatory network for different leukemia subtypes. Light color denotes high centrality. We only show the genes with centrality value >1 for at least one network
Centrality of genes in the human regulatory network for different leukemia subtypes. Light color denotes high centrality. We only show the genes with centrality value >1 for at least one network
3.5 Assessment of network stability
Beside characterization of single genes using centrality, we also performed and experiment to characterize the entire human TRN. In this experiment, we assess the level of stability of each of the studied networks. We measure the stability of the network as the average change in reachability probability when edge probabilities are randomly perturbed.Consider the given probabilistic network and the sets of source and target nodes S and T. Also consider a parameter δ that denotes the maximum change in edge probabilities. We defined a perturbed edge probability function Pδ → [0,1] that, for each edge , returns a value drawn uniformly at random from the range . We constructed a perturbed network . For every pair of and , we measured the reachability probability in G as p(s,t), as well as that in Gδ as pδ(s,t). We then computed the absolute difference . We repeated this experiment 20 times. We computed the average of the resulting values over all and , as well as over the 20 experiments.We plotted the results for different values of δ for the leukemia networks as well as for the control networks. Figure 8 shows the results. The first observation we draw from the figure is that the change in reachability probability for all networks is linear. We also observe that even by perturbing the edge probabilities in the range of ±0.3, the change in reachability probability does not exceed 0.1. From these two observations we can judge transcription-factor regulation in homo sapiens as highly stable and insensitive to random perturbation. This conclusion holds for both healthy people and leukemiapatients. However, we also observe that the gap between the networks is not constant; it slightly increases with the increase of perturbation level. At the extreme case of ±0.3, the gap is maximum. There, T-ALL show the most sensitivity to this level of perturbation, while Hypo and MLL show the least sensitivity.
Fig. 8.
Effect of random perturbation as a measure of network stability: average change in reachability probability (ΔPreach) when each interaction probability p is altered to a random value in the window p ± δ∩[0,1]
Effect of random perturbation as a measure of network stability: average change in reachability probability (ΔPreach) when each interaction probability p is altered to a random value in the window p ± δ∩[0,1]
4 CONCLUSION
In this article we have characterized different types of leukemias based on the state of the regulatory networks in patients affected by this disease. The state is evaluated through reachability profiles. The reachability profile describes the ability of regulator genes to affect the transcription factors. For this we developed a fast, exact method for computing the probability for a signal to reach from a source node to a destination node in a probabilistic network. The rigorous mathematical apparatus, which involves polynomials and polynomial collapsing operators, allows fast execution time, demonstrated in the performance evaluation experiments. Valuable uses of the reachability profiles illustrated in this article include leukemia subtype classification, gene centrality evaluation and regulatory network stability analysis. All these are valuable tools for evaluating the viability of the TRN under varying conditions as a whole, not just limited to individual gene expressions levels. An interesting parallel can be drawn between our solution and Bayesian Network inference. However, as we mentioned in Section 1, this alternative is limited to acyclic networks. We see a possible application of the Bayesian Network alternative in combination with the reduction of strongly connected components to single nodes, but this solution deserves a careful examination by itself.Funding: NSF
CCF-1251599, NSF
DBI-1262451, NSF
IIS-084-5439.Conflict of Interest: none declared.
Authors: Kajsa Paulsson; Erik Forestier; Henrik Lilljebjörn; Jesper Heldrup; Mikael Behrendtz; Bryan D Young; Bertil Johansson Journal: Proc Natl Acad Sci U S A Date: 2010-11-22 Impact factor: 11.205
Authors: Moshe Talpaz; Neil P Shah; Hagop Kantarjian; Nicholas Donato; John Nicoll; Ron Paquette; Jorge Cortes; Susan O'Brien; Claude Nicaise; Eric Bleickardt; M Anne Blackwood-Chirchir; Vishwanath Iyer; Tai-Tsang Chen; Fei Huang; Arthur P Decillis; Charles L Sawyers Journal: N Engl J Med Date: 2006-06-15 Impact factor: 91.245
Authors: Roded Sharan; Silpa Suthram; Ryan M Kelley; Tanja Kuhn; Scott McCuine; Peter Uetz; Taylor Sittler; Richard M Karp; Trey Ideker Journal: Proc Natl Acad Sci U S A Date: 2005-02-01 Impact factor: 11.205
Authors: Mark B Gerstein; Anshul Kundaje; Manoj Hariharan; Stephen G Landt; Koon-Kiu Yan; Chao Cheng; Xinmeng Jasmine Mu; Ekta Khurana; Joel Rozowsky; Roger Alexander; Renqiang Min; Pedro Alves; Alexej Abyzov; Nick Addleman; Nitin Bhardwaj; Alan P Boyle; Philip Cayting; Alexandra Charos; David Z Chen; Yong Cheng; Declan Clarke; Catharine Eastman; Ghia Euskirchen; Seth Frietze; Yao Fu; Jason Gertz; Fabian Grubert; Arif Harmanci; Preti Jain; Maya Kasowski; Phil Lacroute; Jing Jane Leng; Jin Lian; Hannah Monahan; Henriette O'Geen; Zhengqing Ouyang; E Christopher Partridge; Dorrelyn Patacsil; Florencia Pauli; Debasish Raha; Lucia Ramirez; Timothy E Reddy; Brian Reed; Minyi Shi; Teri Slifer; Jing Wang; Linfeng Wu; Xinqiong Yang; Kevin Y Yip; Gili Zilberman-Schapira; Serafim Batzoglou; Arend Sidow; Peggy J Farnham; Richard M Myers; Sherman M Weissman; Michael Snyder Journal: Nature Date: 2012-09-06 Impact factor: 49.962