Literature DB >> 25170616

Identification of important nodes in directed biological networks: a network motif approach.

Abstract

Identification of important nodes in complex networks has attracted an increasing attention over the last decade. Various measures have been proposed to characterize the importance of nodes in complex networks, such as the degree, betweenness and PageRank. Different measures consider different aspects of complex networks. Although there are numerous results reported on undirected complex networks, few results have been reported on directed biological networks. Based on network motifs and principal component analysis (PCA), this paper aims at introducing a new measure to characterize node importance in directed biological networks. Investigations on five real-world biological networks indicate that the proposed method can robustly identify actually important nodes in different networks, such as finding command interneurons, global regulators and non-hub but evolutionary conserved actually important nodes in biological networks. Receiver Operating Characteristic (ROC) curves for the five networks indicate remarkable prediction accuracy of the proposed measure. The proposed index provides an alternative complex network metric. Potential implications of the related investigations include identifying network control and regulation targets, biological networks modeling and analysis, as well as networked medicine.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2014 PMID： 25170616 PMCID： PMC4149525 DOI： 10.1371/journal.pone.0106132

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Complex network theory and its applications have been popular topics in recent years [1]–[8]. Many real-word systems can be described by complex networks and investigated through complex network theory, such as social systems, biological systems. Gene regulatory networks, signal transduction networks, neural networks, protein-protein interaction networks, metabolic networks are typical biological networks, which have been extensively investigated in related areas [9]–[15]. Complex networks consist of nodes and edges. An edge denotes the interaction between two nodes, which can be directed or undirected. Many biological networks are directed ones. For example, in gene regulatory networks, nodes represent genes or transcription factors, edges represent the interactions between transcription factors and the regulated genes, or between transcription factors. Over the last decade, identification of important nodes in complex networks has been an intriguing topic [16]–[32]. For example, in social networks, provided that one knows which nodes are the most important ones, one can control these nodes in priority to prevent the spread of infectious diseases [16]. However, it is still a challenge to determine which nodes are important in a complex network. Traditionally, degree is frequently used to characterize the importance of a node [1]–[3], [6]–[8], [16]. The other indexes include the betweenness [19], closeness [1], k-shell [7], principal component centrality [17] based on adjacency matrix of the network, semi-local centrality [20], motif centrality [25]–[30], PageRank [21] and others therein. For undirected networks, some researchers believe that the most connected nodes are the most influential [1]–[3]. But recently, Kitsak et al. [31] investigated the spreading dynamics on four real-world complex networks. They found that for networks with a single initial spreader, k-shell can predict the outcome of spreading dynamics more reliably than degree and betweenness. Following, Chen et al. [20] proposed a semi-local centrality, which considers the degrees of both the nearest and next nearest neighbors of a node. The semi-local centrality can more effectively characterize influential spreaders in complex networks than the degree and betweenness. Recently, following the method in [31], we identified influential spreaders in artificial random, small-world and scale-free networks. Some general conclusions have been obtained [32]. However, though there are numerous results reported on undirected networks, few results have been reported on directed biological networks [25]–[30]. In 2004, Sporns et al. [27] proposed a concept of motif fingerprint in brain networks, which counts the appearances of each node in network motifs with a given size as a measure. In 2007, based on the motif fingerprints and some of the other centrality measures, Sporns et al. [28] investigated the identification and classification of hubs in some brain networks. Also in 2007, based on the concept of network motif, Koschützki et al. [25], [26] proposed some new motif-based measures for gene regulatory networks. They took the occurrences of each node in the 3-node feed-forward loop (FFL) as a measure, after further considering the direction of each edge, another two extended measures were proposed. Interesting results on finding the global regulators in the gene regulatory network of E. coli have been reported. In this paper, based on the occurrences of each node in all 2-node, 3-node and some 4-node network motifs and the PCA, we aim at developing a new method to characterize node importance in directed biological networks. To evaluate the performance of the new index, the in-degree, out-degree, total degree, PageRank, motif centrality and betweenness are considered to compare with the proposed one. Investigations on five real-world biological networks illustrate the performance of the proposed measure.

Materials and Methods

Network Motifs and Motif Detection

In 2002, Alon et al. proposed the concept of network motif, which is defined as subgraph that appears in a network significantly more than in randomized ones [9], [11]–[13]. Network motifs are building blocks of complex biological networks [11]. Functions of some motifs have been extensively investigated. For example, for the FFLs, researchers have theoretically and experimentally found its functional and structural advantages [33]–[37]. Two-node motifs include the double negative feedback loop, double positive feedback loop, and that with auto-activation or repression loops [38], [39]. Three-node motifs include the FFLs, the repressilator and so on [39], [40], with some of them as shown in Fig. 1. Fig. 1 (a) shows the Drosophila developmental transcriptional networks with 119 nodes and 306 directed edges [13]. Fig. 1(b) shows some representative 2, 3 and 4-node motifs. The motifs in Fig. 1(b) is coded as M, where the subscript i denotes the size of the motif, j is the motif ID number, which is a decimal number that transformed from the adjacency matrix of the motif (For details, one can refer to Mfinder tool guide [41]).

Figure 1

A real-world biological network and some network motifs.

(a) A Drosophila developmental transcriptional network with 119 nodes and 306 directed edges. (b) Some representative 2, 3 and 4-node motifs.

A real-world biological network and some network motifs.

(a) A Drosophila developmental transcriptional network with 119 nodes and 306 directed edges. (b) Some representative 2, 3 and 4-node motifs. To detect network motifs, Milo et al. [11] scanned all possible i-node subgraphs in a network and its randomized counterparts, and defined network motifs as subgraphs for which the probability of occurrences in the real network are greater than that in randomized ones. Since 2002, many motif detection algorithms and softwares [41]–[46] have been developed. For example, gSpan [43], Mfinder [41], FANMOD [45], Mavisto [46] and mDraw (http://www.weizmann.ac.il/mcb/UriAlon). In the following, we use mDraw to detect network motifs. For each network, we generate 100 randomized networks. Number of a subgraph in the real-world network is denoted as N real. The average number in 100 random networks is denoted as N rand, with standard deviation denoted by S d. The Z score measures the significance of the subgraph [11], which is defined as Z score = (N real−N rand)/S d. Another index U is defined as the number of times a subgraph appears in the investigated network with distinct sets of nodes. In this paper, subgraphs with Z score≥2, U≥4 and N real−N rand≥0.1N rand are identified as motifs.

A new measure of node importance based on network motifs

Based on network motifs, we develop a new measure to characterize node importance in directed biological networks. Biological networks consist of some motifs, which act as functional units of the complex networks. For example, it has been found that the FFLs play functional roles in gene regulatory networks, such as an incoherent FFL can act as a fold-change detector [9], [35]. Some other 3-node motifs and the 4-node bi-fan motif M 4204 are also found to play functional roles in biological systems [9], [14]. Therefore, nodes that frequently involved in network motifs may be more important. If a node involves in several different types of network motifs, then this node may potentially have multi-functional roles. Keeping the idea in mind, some related measures have been proposed to investigate the biological networks [25]–[30]. We noted that in some works, network motifs are treated as subgraphs, such as the works of Rubinov et al. [30] and Wuchty et al. [42]. Hereinafter, different from the works in [25]–[30], based on all 2, 3 and some 4-node motifs in directed networks, we propose a new integrative measure. Specifically, suppose we have a directed network with n nodes, and there are totally m types of 2, 3 and 4-node motifs. We denote the occurrences of node i in the j-th type of motif as u, i = 1,…, n, j = 1,… m. Then, we derive a matrix A = (u) × for the network. In real-world networks, the importance of different types of motifs are varied. Therefore, we endow each motif with a weight w, j = 1, 2,…,m, where , here, c(k = 1, 2,…,m) denotes the number of the k-th type of motif. Subsequently, we derive a revised matrix Based on B and the idea of the PCA [47]–[49], we construct the following index to obtain node importance score:where are parameters to be determined. The best index vector I score should have high distinguish ability among different nodes. Therefore, the variance of I score should be as large as possible. Taking as random variables, which represent the weighted counts of a node in the m types of motifs. For a certain network with size n, the n×m matrix B = (b 1, b 2,…, b) is an observation matrix of the m dimensional random vector . The covariance matrix of can be estimated by its observation matrix B. Denote the covariance matrix of B as , then where is the column mean vector of B, n is network size. It is noted that is just the unbiased estimator of [48]. Based on the above notations, we have a stochastic form of I score as . The variance of can be estimated by To determine the unique optimal vector , we restrict Thus, can be determined through the following constrained extremal problem: To solve the optimization problem (2), by the Lagrangian multiplier method, we construct the following Lagrangian function. And let where E is the identity matrix. It follows from Eq.(4) that and are just the eigenvalue and eigenvector of matrix . Under Eq.(4), . Therefore, the optimal and are just the biggest eigenvalue and the corresponding unit eigenvector of . Denote the eigenvalues of as , then the optimal . From the theory of the PCA, the ratio can reflect the contribution of , or how much information in can be extracted by . So far we have determined . For a concrete network, replacing b in Eq.(1) with concrete values, one determines the observation of as I score. Finally, the nodes in the network can be ranked according to I score. Nodes with larger I score values are more structurally important. Based on I score and some well-defined distances, such as the well-known Euclidean distance, the n nodes can be classified into several clusters, where nodes in the same cluster are similarly important. To sum up, for a network with n nodes, the procedures of the proposed measure are as follows. 1) Detect 2, 3 and 4-node network motifs in the network. 2) Count the occurrences of each node in m types of motifs, and derive a n×m matrix A. 3) Perform data processing on A, such as weighting and standardizing matrix A, then we obtain a matrix B. Compute the covariance matrix of B. 4) For , compute the biggest eigenvalue and the corresponding unit eigenvector . 5) Compute I score according to (1) and rank the n nodes accordingly.

An illustrative example

To illustrate the procedures of the proposed method, we give a simple example. The simple artificial network contains 6 nodes, and the topology of the network is shown in Fig. 2 (a). Suppose there are three motifs in the network, namely, M 338, M 3108, M 26, as shown in Fig. 2(b). Fig. 2(c) lists the members of the three motifs. Occurrences of nodes in each motif are summarized in Fig. 2(d). As we see, the occurrences of M 338, M 3108, M 26 are 8, 2 and 2, respectively. Therefore, the weights of M 338, M 3108, M 26 are Subsequently, we derive matrix B and its covariance matrix .

Figure 2

An illustrative example.

An illustrative example.

(a) A simple network with six nodes. (b) Subgraphs that are assumed to be motifs in network (a). (c) Members that compose the three types of motifs. (d) Appearances of nodes in each motif as shown in panel (b). (e) Frequency histograms for the six nodes. (f) Cluster analysis reveals that the six nodes can be remarkably classified into three classes. v 1, v 3, v 5 are the most important nodes, and v 2 forms the least important group, v 4, v 6 form another group, which is more important than v 2. The eigenvalues of are and the unit eigenvector corresponding to is: Thus, we have The contribution of I score is . That is, 98.50% information that contained in b 1, b 2, b 3 can be extracted by I score. Therefore, I score can optimally rank the 6 nodes. Substitute b 1, b 2, b 3 in matrix B into Eq.(5), we have From I score, the third value is the biggest. Therefore, we can judge that node v 3 is the most important one, and then v 1, the least important node is v 2. If one simply considers the total occurrences of a node in all the motifs, then v 2 and v 4 would be treated as equally important. Whereas, from the proposed method, v 4 is more important than v 2, which is reasonable in that the occurrences of M 338 are significantly more frequent than the other motifs. Based on I score and through cluster analysis, the six nodes can be classified into three clusters, where v 1, v 3, v 5 are members of the most important cluster; v 4, v 6 are members of the less important cluster; while v 2 is the single member of the unimportant cluster.

Data descriptions

The five real-world biological networks include the C. Elegans Neural (CEN) network [50], [51], the E. Coli Transcriptional (ECT) regulatory network from the RegulonDB database [52], the Yeast Transcriptional (YT) regulatory network [53], the Drosophila Developmental Transcriptional (DDT) network [13], and the Human Signal Transduction (HST) network [13]. We note that the investigated networks are with high quality and have been frequently used as models to detect network motifs [9], [11]–[13]. Simple statistical indexes for the five networks are summarized in Tab. 1. Numbers of nodes for these networks range from 119 to 1706. Numbers of edges range from 306 to 3870. The five networks are with abundant network motifs, such as the FFL M 338, M 346, the bi-fan M 4204. It is noted that, we have considered all 2, 3-node motifs, but for simplicity, we have only considered three 4-node motifs: M 4204, M 4328 and M 4904. There are totally 199 connected 4-node subgraphs, and there are many 4-node motifs in the five networks. For example, in the CEN and ECT, there are seven 4-node motifs. Since the bi-fan M 4204 and the bi-parallel M 4904 have been frequently investigated under various context [9], they are common motifs in many different real-world networks [11], and the 4-node chain M 4328 may play crucial roles in signal transduction pathways, we will only consider these three 4-node motifs. From Tab. 1, the CEN has the most abundant of motifs. Subgraph M 26 is only a motif in the CEN and ECT, and the actually numbers are 233 and 10, respectively. The M 4328 is only a motif in the HST, the actual number is 1570. There are no 3-node motifs in the HST. Whereas, for most of the networks, the FFL and bi-fan are motifs. The YT only consists of the FFL and bi-fan.

Table 1

Statistical indexes for the five directed biological networks.

Network	CEN	ECT	DDT	HST	YT
Node	280	1706	119	227	685
Edge	2194	3870	306	312	1052
Ave. in-degree	7.8357	2.2685	2.5714	1.3744	1.5358
Ave. out-degree	7.8357	2.2685	2.5714	1.3744	1.5358
Ave. total degree	15.6714	4.5369	5.1428	2.7489	3.0715
Ave. I^score	5.6753	35.9339	2.0367	12.3849	7.2407
M ₂6	233	10	-	-	-
M ₃38	1453	1196	174	-	62
M ₃46	552	226	26	-	-
M ₃108	385	-	16	-	-
M ₃110	175	-	-	-	-
M ₃238	48	-	-	-	-
M ₄204	2274	29535	-	280	1812
M ₄328	-	-	-	1570	-
M ₄904	2253	-	-	275	-

“-” denotes no such item.

Results

Identifying important nodes in the five networks

Following the procedures as the illustrative example, one can obtain the order factor for each network. Noted that the occurrences of different motifs have different order of magnitude, we have performed standardized transformations to matrix B. Moreover, we denote the columns of matrix B as the vector b, where i and j have the same meaning as that in M. The I score for the five networks are obtained as follows. Replacing b with concrete values in matrix B for each network, one obtains the importance score for each node. Average I score values for the five networks are shown in Tab. 1. Based on I score, we can characterize the node importance and classify the nodes for each network via cluster analysis. The basic idea of cluster analysis is as follows [48]. According to I score, the Euclidean distance between any two nodes can be obtained. Firstly, two nodes with the shortest distance are merged as one group, each of the rest nodes forms a group. Then, one merges node groups via the single linkage method, until all nodes are finally merged into one cluster. This cluster processes can be mimicked by a dendrogram. From cluster analysis, one can classify nodes into groups, with similar important nodes in the same group. Furthermore, from the dendrogram, one can intuitively get some knowledge about the structural features of the network. Fig. 3 shows the dendrogram for the top-30 nodes of the five networks. We can see that these nodes can be roughly classified into three or four groups, detailed information of the top-30 nodes in the CEN, ECT, YT and their corresponding rankings by the other methods are summarized in Tabs. 2–4. The corresponding information for the DDT and HST are shown in Table S1 and S2. In each table, we have shown the in and out-degree as well as their rankings by the other methods. Here, R total is based on the total degree, R p is based on the PageRank, R mc is based on the motif centrality, and R bet is based on the betweenness. The motif centrality only considers the FFL, since there are no such motif in the HST, it fails to work in the HST. For each network, the last group contains the largest amount of nodes, while the most important group G 1 contains only one to three nodes. From Fig. 3, for the five biological networks, only a few nodes are far more important than the others. There are clear hierarchical structures in these networks, which indicates that the proposed measure may also act as an effective hierarchical index.

Figure 3

Cluster analysis for the identified top-30 nodes in the five networks based on the I score.

Table 2

Clusters, members, rankings and statistical characteristics of the identified top-30 nodes in the CEN.

Group	Node	I ^score	Out-deg.	R _out	In-deg.	R _in	R _total	R _p	R _mc	R _bet
G ₁	12:AVER	54.90	18	11	33	6	10	126	6	22
	58:AVBR	54.50	15	14	38	4	8	119	4	28
	25:AVEL	53.81	16	13	36	5	9	129	5	21
G ₂	149:AVDR	45.99	24	7	33	6	6	64	2	34
	131:AVDL	41.11	19	10	27	7	11	106	12	49
	56:AVBL	35.86	20	9	40	3	3	58	1	18
	71:AVAL	35.07	37	2	53	1	2	28	3	2
G ₃	94:AVJR	24.48	12	17	18	13	21	114	7	96
	28:AVAR	23.85	49	1	49	2	1	13	8	1
	113:AIBR	23.55	11	18	25	9	16	143	11	42
	107:DVA	22.30	35	3	19	12	7	24	9	7
	157:PVCL	21.90	32	4	27	7	4	32	10	5
	143:PVCR	21.11	32	4	26	8	5	16	13	3
	35:RICR	20.48	8	21	14	17	29	173	36	39
	179:ADAL	20.46	14	15	8	23	29	97	15	102
	31:RICL	18.35	12	17	11	20	28	130	36	56
	148:AVL	15.63	12	17	15	16	24	83	19	58
	41:ADEL	15.42	26	5	5	26	20	6	28	54
	204:PVNL	14.63	19	10	8	23	24	12	18	70
	45:RIAL	14.58	15	14	27	7	13	117	20	19
	24:AIBL	14.43	13	16	24	10	15	66	15	11
	223:AVG	14.42	17	12	5	26	29	4	14	69
	22:RIAR	14.22	18	11	26	8	12	108	17	10
	207:ASHR	13.88	13	16	8	23	30	60	30	119
	162:AVJL	13.12	14	15	21	11	17	25	16	31
	87:RMGR	12.84	14	15	8	23	29	23	40	53
	199:ADLR	12.38	15	14	2	29	34	77	23	146
	150:HSNR	11.89	25	6	16	15	14	8	22	8
	173:PVNR	11.55	22	8	11	20	19	15	23	36
	112:RIML	11.13	12	17	16	15	23	139	21	73

R in and R out represent the rankings by the in and out-degree. R total and R p represent the results from the total degree and PageRank [21]. R mc and R bet denote the results from the motif centrality [25] and betweenness. Similarly hereinafter.

Table 4

Clusters, members, rankings and statistical characteristics of the identified top-30 ranked nodes in the YT.

Group	Node	I ^score	Out-deg.	R _out	In-deg.	R _in	R _total	R _p	R _mc	R _bet
G ₁	553:STE12	489.54	71	1	0	11	1	1	10	19
	575:TEC1	482.02	44	3	0	11	3	7	10	19
G ₂	360:MSN2	363.05	35	6	0	11	6	14	10	19
	361:MSN4	356.90	32	7	0	11	7	16	10	19
G ₃	622:YAP1	138.11	38	4	0	11	4	6	10	19
	513:SKN7	121.70	21	13	0	11	12	19	10	19
G ₄	355:MIG1	58.87	26	9	0	11	9	4	7	19
	546:SSA4	49.23	0	32	4	7	28	100	10	19
	587:TKL2	49.23	0	32	4	7	28	100	10	19
	119:CTT1	45.81	0	32	6	5	26	100	10	19
	356:MIG2	45.79	12	20	0	11	20	31	9	19
	267:HSP78	42.39	0	32	4	7	28	100	10	19
	399:PGM2	42.39	0	32	6	5	26	100	10	19
	614:UME6	34.30	38	4	0	11	4	3	5	19
	264:HSP12	33.50	0	32	4	7	28	100	10	19
	152:DOG2	32.82	0	32	4	7	28	100	10	19
	523:SOD2	32.82	0	32	5	6	27	100	10	19
	592:TPS1	32.82	0	32	4	7	28	100	10	19
	651:YLR042C	32.82	0	32	4	7	28	100	10	19
	64:CAR2	30.08	0	32	6	5	26	100	10	19
	679:DAL81-DAL82	28.72	8	24	0	11	24	61	10	19
	263:HSP104	25.98	0	32	3	8	29	100	10	19
	265:HSP26	25.98	0	32	3	8	29	100	10	19
	364:MUC1	25.98	0	32	5	6	27	100	10	19
	545:SSA3	25.98	0	32	3	8	29	100	10	19
	100:CLN1	25.96	0	32	5	6	27	100	9	19
	59:BNI5	25.30	0	32	2	9	30	100	10	19
	143:DDR48	25.30	0	32	3	8	29	100	10	19
	145:DHH1	25.30	0	32	2	9	30	100	10	19
	208:GAT4	25.30	0	32	2	9	30	100	10	19

Functional characteristics of the top-ranked nodes

In the following, for the CEN, ECT and YT, we discuss whether the identified structurally top-ranked nodes are functionally important. For the CEN, the identified top-30 nodes are shown in Tab. 2. The top-7 nodes are AVER, AVBR, AVEL, AVDR, AVDL, AVBL and AVAL, which are all command interneurons. Additionally, the AVAR, PVCL and PVCR are another three command interneurons, which are all top-ranked. The AVAs, AVBs, AVDs, and PVCs are four bilaterally symmetric interneuron pairs with large diameter axons that run the entire length of the ventral nerve cord, and providing inputs to the ventral cord motor neurons. The AVAs locate at the lateral ganglia of head of the C. elegans, functioning as driver cell for backward locomotion [54]. The AVEs can drive backward movement of the animal along with AVAs, AVDs and A-type motor neurons [54]. The AVDs function as touch modulator for backward locomotion induced by head-touch. The PVCs are ventral cord interneurons, a harsh touch defect can be caused in the absence of PVC neurons [54]. From Tab. 2, the AVER has the largest I score value 54.90, the in and out-degree of AVER are 33 and 18, which are not the largest. However, from our investigation, the AVER is the most important nodes in the CEN, which demonstrates that the I score is different from the degree measures. The PageRank fails to identify most of the command interneurons as even among the top-50 level. The betweenness ranks many of the command interneurons out of the top-20 level. The results for the CEN indicate I score can help to identify the actual important nodes. For the ECT, the identified top-30 nodes are shown in Tab. 3. In 2003, Martínez-Antonio et al. [55] identified global regulators in an ECT network. There are 18 global regulators in the network, namely, CRP, IHF, FNR, fis, arcA, lrp, hns, narL, ompR, fur, phoB, cpxR, soxR, soxS, mlc, cspA, rob, purR. Among which, the CRP, FNR, IHF, fis, arcA, narL, lrp are seven key regulators, which can regulate the expression of 51% of genes in E. coli [55]. From I score, eight of the top-12 nodes (CRP, FNR, arcA, IHF, fis, narL, lrp, fur) are global regulators. The in-degree ranks most of the eight global regulators at the tail. The out-degree and total degree rank most of the eight global regulators at the top-10 level. According to the PageRank, motif centrality and betweenness, 2, 1 and 3 of the identified top-ranked global regulators are out of the top-10 level. The global regulator CRP is the most important nodes, which represents the cAMP receptor protein. The CRP can regulate cAMP, and genes regulated by the CRP are mostly involved in energy metabolism [56]. The CRP has the largest out-degree 496. But its in-degree is only 1. Though 280: csgE has the second largest in-degree 12, it is not top-30 ranked. From Tab. 3, the top-30 nodes can be classified into four clusters. The unimportant cluster contains the largest amount of nodes. The first three clusters are almost all global regulators. The observations from the ECT indicate that the proposed measure can help to find global regulators.

Table 3

Clusters, members, rankings and statistical characteristics of the identified top-30 nodes in the ECT.

Group	Node	I ^score	Out-deg.	R _out	In-deg.	R _in	R _total	R _p	R _mc	R _bet
G ₁	325:CRP	6643.21	496	1	1	13	1	2	2	6
G ₂	98:FNR	4128.98	295	2	3	11	2	4	1	2
G ₃	844:arcA	3014.10	173	6	1	13	6	6	18	9
	1682:IHF	2924.94	219	4	0	14	4	3	3	55
	368:fis	2564.89	226	3	2	12	3	1	7	8
	910:narL	2267.63	121	8	2	12	8	18	5	18
	1678:H-NS	1675.38	186	5	0	14	5	7	4	55
	1691:narP	1438.18	49	17	0	14	17	39	41	55
G ₄	1542:cra	801.16	78	12	1	13	12	17	23	21
	1204:lrp	637.09	104	9	3	11	9	14	8	3
	1672:FlhDC	574.76	80	11	0	14	11	12	41	55
	384:fur	555.05	128	7	3	11	7	5	6	4
	1693:NsrR	538.26	83	10	0	14	10	11	26	55
	1688:ModE	386.47	46	18	0	14	18	20	25	55
	333:cysG	348.34	0	52	8	6	43	134	38	55
	534:nirB	348.34	0	52	8	6	43	134	38	55
	535:nirC	348.34	0	52	8	6	43	134	38	55
	536:nirD	348.34	0	52	8	6	43	134	38	55
	159:pflB	257.65	0	52	6	8	45	134	38	55
	558:pdhR	251.49	41	20	3	11	19	27	17	15
	13:gadX	234.85	27	29	13	1	22	30	12	5
	10:gadA	218.98	0	52	11	3	40	134	28	55
	922:nrfA	218.77	0	52	7	7	44	134	38	55
	923:nrfB	218.77	0	52	7	7	44	134	38	55
	924:nrfC	218.77	0	52	7	7	44	134	38	55
	925:nrfD	218.77	0	52	7	7	44	134	38	55
	926:nrfE	218.77	0	52	7	7	44	134	38	55
	927:nrfF	218.77	0	52	7	7	44	134	38	55
	928:nrfG	218.77	0	52	7	7	44	134	38	55
	136:lpd	207.41	0	52	7	7	44	134	36	55

For the YT, the top-30 nodes are shown in Tab. 4. STE12 and TEC1 are two most important nodes, with the I score values 489.54 and 482.02, with the out-degree 71 and 44, and with the in-degree both 0. STE12 and TEC1 are two transcription factors. It has been reported that the STE12 controls two distinct developmental programs of mating and filamentation, therefore, it is a key regulator of cell fate determination [57]. Although the TEC1 gene has been reckoned as involving in the activation of expression of Tyl and the adjacent genes, it is not essential in the control of mating or sporulation processes [58]. It is intriguing to clarify why TEC1 is so frequently involved in network motifs and acts as building blocks of the YT network. From the results of the out-degree, total degree, PageRank, motif centrality and betweenness, most of the nodes in G 4 are equally important, and thus have great differences from I score.

Performance evaluation based on ROC curves

To evaluate the performance of I score, we perform ROC analysis. ROC curve is frequently used to evaluate the performance of a new test in the field of signal processing and medical diagnostic tests [59]. For a network with n nodes, the procedures of ROC analysis are as follows. Suppose the nodes can be classified into two groups: important and unimportant, and we know the actual classification. For a new index, the n nodes are with values in the interval [a, b], for any threshold value , one can reclassify the n nodes into two classes. Comparing the actual classification with the new classification, several indexes can measure the accuracy of the new index, which are defined as follows [59]. where n 2 denotes the number of false positive nodes, which are considered important in the new classification but actually unimportant. n 4 gives the number of true negative nodes, where the nodes are both unimportant in the two classifications. Similarly, n 1 and n 3 denote the number of true positive and false negative nodes, respectively. P 1, P 2 are therefore called false and true positive rates, respectively. P 3 is called the accuracy of the new index. Given a T, one obtains a point (P 1, P 2). For , plotting the corresponding points in two dimensional coordinate system, we derive the ROC curve. The area under the curve (AUC) of ROC equals the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one [59], which can reflect the identification accuracy of the new index. The larger AUC, the more accurate of the index. Furthermore, the point in the upper left corner of a curve corresponds to the optimal threshold T, which gives the new classification of nodes with the highest P 3. Hereinafter, based on the available information of some of the investigated networks and ROC curves, we evaluate the performance of I score and the other indexes. In the following, for simplicity, we transform node ranks into fractional ones (range in (0, 1]). For nodes with rank k, its fractional ranks are the ratio of the number of nodes with ranks no more than k to n. Obviously, nodes with smaller fractional ranks are more important. For the CEN, on one hand, we have mentioned that the 10 command interneurons are known to be very important. If we take them as important nodes, one derives the ROC curves for each index, as shown in Fig. 4(a). From Fig. 4(a), the in-degree, total-degree, I score and motif centrality all can well identify the command interneurons, the AUC (trapezoidal method) for these indexes are 0.9991, 0.9985, 0.9974, 0.9967, which are all above 0.99. The I score is a little better than the motif centrality. The out-degree, PageRank and betweenness are all worse than the other indexes. On the other hand, neurons in the C. Elegans can be classified into interneurons, motor neurons, sensory neurons, where 117 neurons function as interneurons. If we take the 117 interneurons as important nodes, one obtains another ROC curve for each index, as shown in Fig. 4(b), where all the seven measures have roughly similar performance. The I score is a little better than the out-degree, in-degree, PageRank and betweenness. For the ECT, there are 7 key and totally 18 global regulators, which are actually important in the network. If we take the 7 key global regulators and 18 global regulators as actually important nodes, we derive two ROC curves for each index, as shown in Fgs.4(c) and (d). In Fig. 4(c), the AUC for the seven indexes are 0.9996, 0.4385, 0.9996, 0.9997, 0.9983, 0.9987 and 0.9239. Except the in-degree and betweenness, all the indexes can well identify the key global regulators. I score is a little better than the other indexes. From Fig. 4(d), the out-degree, total-degree, PageRank and motif centrality are with quite large AUC. The AUC for the I score is 0.8628, which is only higher than that for the in-degree and betweenness, however, when T = 0.0036, the I score can classify the nodes in the ECT with P 3 = 99.30%.

Figure 4

ROC curves based on the available information in the CEN and ECT.

ROC curves based on the available information in the CEN and ECT.

(a) Performance of different indexes in identifying (a) the 10 command interneurons in the CEN, (b) the 117 interneurons in the CEN, (c) the 7 key global regulators in the ECT, (d) the 18 global regulators in the ECT. For many biological networks, the actual classifications, known as gold standards, are not available. Fortunately, researchers have proposed several methods to evaluate the new test, such as constructing composite reference standards from available multiple tests [60], [61]. A single ranking from either the in, out, total degrees, PageRank, motif centrality or betweenness is imperfect and can not act as a gold standard. Subsequently, for each network, we construct a composite reference standard based on the six rankings (Five in the HST), and evaluate the accuracy of I score. Specifically, in the composite reference standard, a node is defined as important if either one of the six rankings is among the top-T 0, where T 0 is a threshold, which can be taken as 10%, 20% and so on. Thus, given a T 0, we derive a dichotomous reference classification of nodes in the network, either positive (important) or negative (unimportant). According to the ranking from the I score, we take several threshold values T to reclassify nodes, and finally derive the ROC curves for each network, as shown in Fig. 5. Figs. 5(a) and (b) show the cases with T 0 = 10% and T 0 = 20%, respectively. In Fig. 5(a), the AUC for the five networks are 0.8977, 0.8237, 0.9406, 0.8499 and 0.7878, respectively. The points in the upper left corner of the ROC curves in Fig. 5(a) correspond to T = 20%, 5%, 10%, 10%, 5%, which lead to the highest P 3. For example, for the DDT, when the top-10% nodes are classified as important ones, the classification from the I score has the best consistency with the reference classification, the P 3 can achieve 94.96%. For T 0 = 20%, the AUC for the five networks are 0.8740, 0.8884, 0.9521, 0.8955 and 0.7418, respectively. Under two different T 0 and for different networks, the AUC are all above 0.74. Especially, in the DDT, the AUC is above 0.94, which indicates high identification accuracy of the proposed measure.

Figure 5

Evaluation of I score via ROC curves with composite reference standards for the five networks.

Evaluation of I score via ROC curves with composite reference standards for the five networks.

(a) T 0 = 10%. A node is defined as important if either its rankings by the in, out, total degree, PageRank, motif centrality or the betweenness are at the top-T 0 level. (b) Similarly to (a), but with T 0 = 20%. It is noted that, for the DDT, HST and YT, since we still do not know how many nodes are actually important, it is difficult to compare among different measures via ROC curves. We also note that the ROC analysis without gold standards may subject to bias of the composite reference standard. However, since the composite reference standards for the five networks are based on six or five existing measures, it is trustworthy to treat them as reference standards. In conclusion, ROC analysis indicates the proposed measure is a remarkable alternative index to identify structurally important nodes in directed networks.

Topological neighborhoods of several special nodes

From the ROC analysis in the CEN and ECT, some measures are better than the I score in identifying the command interneurons or global regulators. Hereinafter, through the specific analysis on topological neighborhoods of several nodes, we further illustrate the merits of the proposed measure. According to I score, some hubs may be not important, whereas some non-hub nodes may be identified as very important ones. There are many highly connected but not highly ranked nodes, such as 946: soxs in the ECT; 22: b-catenin and 68: fak in the HST; 209: GCN4 and 332: MBP1-SWI6 in the YT. Examples of nodes with low degrees but ranked at top-20 include 333: cysG and 534-536:nirB-nirD in the ECT; 546: SSA4 and 587:TKL2 in the YT. In the following, we take node 209 and 546 in the YT as two representative examples. Node 209 has out-degree 53 and in-degree 0, which is the second most important node according to the out and total degree, while its ranking is 62 according to I score. Node 546 is with the in and total degree 4, the ranking is 28 according to the total degree, but it is ranked as the eighth most important node by the I score. Figs. 6 (a) and (b) visualize the topological neighborhoods of the two nodes with their nearest and second nearest neighbors. From the topological neighborhoods of the two nodes, there are 81 nodes involved in the neighborhood of node 209, which are connected by 111 directed edges that centered at node 209, while 114 nodes and totally 182 directed edges consist of the neighborhood of node 546. The connection density of the neighborhood of node 209 is much lower than node 546. Moreover, from Fig. 6 (b), one can easily see that node 546 is directly regulated by four hub nodes and acts a bridge or bottleneck of the topological neighborhood. More importantly, the four hub neighbors of node 546 are just the top-4 nodes. Though node 209 can regulate 53 nodes, but its neighbors are neither hubs nor important nodes. Furthermore, node 546 involves in 1203 bi-fan subgraphs in its topological neighborhood, while there are only 39 such subgraphs for node 209, which indicate node 546 may play more functional roles in the system. Therefore, node 546 may be more important than 209. Finally, from the roles of biological functions, node 209 represents GCN4. It has been found that the GCN4 gene is conserved in S. cerevisiae, K. lactis, and E. gossypii [62]. SSA4 is widely conserved in human, chimpanzee, Rhesus monkey, dog, cow, mouse, rat, chicken, zebrafish, fruit fly, C. elegans, S. cerevisiae, and A. thaliana [62]. The cross species conservation of a gene indicates that it has been maintained by evolution despite speciation. It has been widely believed that mutation in a highly conserved gene can lead to a non-viable life form, or a form that is eliminated through natural selection [62], [63]. SSA4 is more widely cross species conserved, which also indicates that SSA4 is more important than GCN4. Summing up, it is sufficient that the non-hub node 546 is actually more important than the hub node 209.

Figure 6

Topological neighborhoods of several nodes.

Topological neighborhoods of several nodes.

(a) Topological neighborhood of a hub but not top-ranked node: node 209 in the YT. (b) Topological neighborhood of a non-hub but top-ranked node: node 546 in the YT. (c) Topological neighborhood of a not top-ranked node but with the highest betweenness: node 293 in the YT. From the above analysis, it seems that node 546 similarly functions as nodes with high betweenness in undirected networks. However, we note that there are great differences between directed and undirected networks. In the YT, the node 209 has only 53 outgoing edges and the node 546 has only 4 ingoing edges, the betweenness [19] of the two nodes are both zeros, which are the least important nodes according to it. Therefore, the I score is different from the classical betweenness. Furthermore, since the YT is a directed network, the betweenness of 96.06%(658/685) nodes are zeros, it fails to act as an effective ranking measure. It is noted that node 293: IME1 has the largest betweenness in the YT, whereas, it is not highly ranked according to I score. Fig. 6 (c) shows the topological neighborhood of node 293. Node 293 is with 5 ingoing and 13 outgoing edges, but it is not frequently involved in network motifs. In conclusion, from the topological neighborhoods of several concrete nodes, we can further conclude that the proposed measure has its merits.

Discussion

Biological networks are typical real-world complex networks. It has been reported that a single measure is insufficient to distinguish lethal nodes clearly from viable ones in some biological networks [26], [64]. Therefore, it is intriguing to find some more effective measures to characterize node differences in biological networks. In this paper, based on the integration of the occurrences of each node in 2, 3 and some 4-node network motifs, we have proposed a new measure to characterize node importance in biological networks. Based on ROC curves and the analysis of the topological neighborhoods of several specific nodes, we have compared the obtained results with that from the degree, PageRank, motif centrality and betweenness. In the CEN and YT, when the command interneurons, interneurons, key global regulators and global regulators are treated as actually important nodes, we compared the performance among different measures. The proposed measure has good performance in the two networks. The in-degree is good at identifying command interneurons in the CEN, but it is bad at finding global regulators in the ECT. The out-degree displays the contrary tendency as the in-degree. Though the PageRank can effectively identify the global regulators in the ECT, it is the worst measure in identifying command interneurons or interneurons in the CEN. Similarly, the betweenness is also not a good measure in the two networks. Therefore, the in-degree, out-degree, PageRank and betweenness are not robust indicators of important nodes in different networks. The I score provides an alternative robust measure for different types of biological networks. Since the current knowledge on the five networks are limited, we note that it is still an open problem to further mining the advantages of the new measure. The number of command interneurons in the CEN and global regulators in the ECT are much fewer than the network sizes, the ROC analysis may suffer the effect of noise both in the interaction data and computation processes. We note that some other approaches may be used to further investigate the merits of the new measure, such as rich-cub analysis [8], [65]–[69]. For simplicity, we simply examine the connectivity densities among the same amount of top-ranked nodes according to different measures in the ECT and HST, as shown in Fig. 7. Here, is defined as the ratio of the total actual number of edges to the maximum possible number of edges among the top- nodes. In Fig. 7, different curves correspond to different indexes. From Fig. 7, we can see that for many indexes, top-ranked nodes tend to be with higher connectivity densities than nodes ranked at the tail. The motif centrality fails to work in the HST, since the FFL is not a motif in such network. Moreover, comparing among different indexes, the I score is very good at finding the cluster with high connectivity densities. That is, the connectivity density among a few motif-rich nodes are higher than the same number of top-ranked nodes by the other indexes. For example, in the HST, the connectivity density among the top-10% () motif-rich nodes is above 0.10, while the top-10% large-degree nodes are with below 0.08.

Figure 7

The curves of connectivity density against for different ranking measures in the ECT and HST.

It has been reported that many bio-molecular networks are disassortative, which have negative Pearson correlation coefficients (PCCs) [6]. For example, the PCCs of the CEN and YT are −0.0537 and −0.3496. The disassortativity indicates that large-degree nodes would connect with low-degree ones rather than with each others. Whereas, nodes with high I score involve in many network motifs. Motif-rich nodes tend to form small connected subgraphs. Thus, the I score may be helpful to find clusters with high connectivity density in disassortative networks. Finally, we note that this paper only considers five real-world biological networks, it is intriguing to further investigate the performance of the I score in some artificial networks, such as artificial scale-free, small-world networks and networks with community structures. It is noted that for networks with large cliques at the periphery, nodes in the cliques may have very high I score values, and therefore, these nodes may be highly ranked. Therefore, for such networks, the identified highly ranked clusters are probably just the large cliques. We will further investigate the related questions in our future works.

Conclusions

In this paper, based on network motifs and multivariate statistical analysis, we have proposed a novel measure to characterize node importance in directed biological networks. The new measure enable us to further mining undiscovered characteristics of nodes in directed biological networks. Through the new measure, we have investigated five real-world biological networks, which include a neural network, three transcriptional regulatory networks and one signal transduction network. These networks vary in sizes and link densities, and consist of various types of network motifs. Based on the proposed measure, we have identified important nodes in the five networks. Our investigations suggest that the most important nodes in biological networks only take up a small fractions, but many of them are with important biological functions in real-world biological systems. Moreover, ROC analysis reveals that the proposed measure is a rather stable indicator of important nodes, and with very high prediction accuracy. Furthermore, the proposed measure can well characterize non-hub but very evolutionary conserved functional important nodes, and simultaneously, exclude hubs but not so functionally important nodes from the top rankings. Finally, we have discussed that the proposed measure may be used to reveal clusters with high connectivity density in disassortative networks. From these statistical analysis, we conclude that the proposed measure has some unique merits and it can be acted as an alternative network metric. Although we have mainly investigated some directed biological networks, the proposed measure can be extended to some other networks, such as electrical networks, social networks. It is also noted that the proposed measure can be extended to involve more types of network motifs, but with the increasing of motifs, the computational complexity will be increased. Moreover, if the FFL is the unique network motif in a directed network, the proposed method will degenerate into the motif centrality [25]. Lastly, we note that this paper provides an alternative way to characterize node features, it is still an open problem to find more effective ranking measures for nodes in directed biological networks, since it is generally difficult to obtain the actual rankings and a single measure is often insufficient to perfectly characterize all nodes. The related researches can help us to identify the actual key nodes in real-world systems. Real-world implications of identifying the key nodes include the finding of network control and regulation targets. For example, we can explore disease-associated or essential genes in cellular networks [70]–[72] for pharmacological or re-engineering purpose. Clusters, members, rankings and statistical characteristics of the identified top-30 ranked nodes in the DDT. (PDF) Click here for additional data file. Clusters, members, rankings and statistical characteristics of the identified top-30 ranked nodes in the HST. (PDF) Click here for additional data file.

48 in total

1. Using a combination of reference tests to assess the accuracy of a new diagnostic test.

Authors: T A Alonzo; M S Pepe
Journal: Stat Med Date: 1999-11-30 Impact factor: 2.373

2. Network motifs: simple building blocks of complex networks.

Authors: R Milo; S Shen-Orr; S Itzkovitz; N Kashtan; D Chklovskii; U Alon
Journal: Science Date: 2002-10-25 Impact factor: 47.728

3. Rich-club organization of the human connectome.

Authors: Martijn P van den Heuvel; Olaf Sporns
Journal: J Neurosci Date: 2011-11-02 Impact factor: 6.167

4. Regulation of mating and filamentation genes by two distinct Ste12 complexes in Saccharomyces cerevisiae.

Authors: Song Chou; Shelley Lane; Haoping Liu
Journal: Mol Cell Biol Date: 2006-07 Impact factor: 4.272

Review 5. Evaluation of diagnostic tests when there is no gold standard. A review of methods.

Authors: A W S Rutjes; J B Reitsma; A Coomarasamy; K S Khan; P M M Bossuyt
Journal: Health Technol Assess Date: 2007-12 Impact factor: 4.014

6. Prominence and control: the weighted rich-club effect.

Authors: Tore Opsahl; Vittoria Colizza; Pietro Panzarasa; José J Ramasco
Journal: Phys Rev Lett Date: 2008-10-17 Impact factor: 9.161

7. Identification and evolution of structurally dominant nodes in protein-protein interaction networks.

Authors: Pei Wang; Xinghuo Yu; Jinhu Lü
Journal: IEEE Trans Biomed Circuits Syst Date: 2014-02 Impact factor: 3.833

8. Rich club organization and intermodule communication in the cat connectome.

Authors: Marcel A de Reus; Martijn P van den Heuvel
Journal: J Neurosci Date: 2013-08-07 Impact factor: 6.167

9. Motifs in brain networks.

Authors: Olaf Sporns; Rolf Kötter
Journal: PLoS Biol Date: 2004-10-26 Impact factor: 8.029

10. Are "ultraconserved" genetic elements really indispensable?

Authors: Liza Gross
Journal: PLoS Biol Date: 2007-09-04 Impact factor: 8.029

21 in total

1. PhysarumSpreader: A New Bio-Inspired Methodology for Identifying Influential Spreaders in Complex Networks.

Authors: Hongping Wang; Yajuan Zhang; Zili Zhang; Sankaran Mahadevan; Yong Deng
Journal: PLoS One Date: 2015-12-18 Impact factor: 3.240

2. Topological network measures for drug repositioning.

Authors: Apurva Badkas; Sébastien De Landtsheer; Thomas Sauter
Journal: Brief Bioinform Date: 2021-07-20 Impact factor: 11.622

3. On the Computing Potential of Intracellular Vesicles.

Authors: Richard Mayne; Andrew Adamatzky
Journal: PLoS One Date: 2015-10-02 Impact factor: 3.240

4. Integrative topological analysis of mass spectrometry data reveals molecular features with clinical relevance in esophageal squamous cell carcinoma.

Authors: She-Gan Gao; Rui-Min Liu; Yun-Gang Zhao; Pei Wang; Douglas G Ward; Guang-Chao Wang; Xiang-Qian Guo; Juan Gu; Wan-Bin Niu; Tian Zhang; Ashley Martin; Zhi-Peng Guo; Xiao-Shan Feng; Yi-Jun Qi; Yuan-Fang Ma
Journal: Sci Rep Date: 2016-02-22 Impact factor: 4.379

5. The H-index of a network node and its relation to degree and coreness.

Authors: Linyuan Lü; Tao Zhou; Qian-Ming Zhang; H Eugene Stanley
Journal: Nat Commun Date: 2016-01-12 Impact factor: 14.919

6. EpiTracer - an algorithm for identifying epicenters in condition-specific biological networks.

Authors: Narmada Sambaturu; Madhulika Mishra; Nagasuma Chandra
Journal: BMC Genomics Date: 2016-08-18 Impact factor: 3.969

7. Motif-role-fingerprints: the building-blocks of motifs, clustering-coefficients and transitivities in directed networks.

Authors: Mark D McDonnell; Ömer Nebil Yaveroğlu; Brett A Schmerl; Nicolangelo Iannella; Lawrence M Ward
Journal: PLoS One Date: 2014-12-08 Impact factor: 3.240

8. A measure of regularity for polygonal mosaics in biological systems.

Authors: Gabriela Contreras-Figueroa; Luis Hernández-Sandoval; José L Aragón
Journal: Theor Biol Med Model Date: 2015-11-16 Impact factor: 2.432

9. The effect of inhibition of PP1 and TNFα signaling on pathogenesis of SARS coronavirus.

Authors: Jason E McDermott; Hugh D Mitchell; Lisa E Gralinski; Amie J Eisfeld; Laurence Josset; Armand Bankhead; Gabriele Neumann; Susan C Tilton; Alexandra Schäfer; Chengjun Li; Shufang Fan; Shannon McWeeney; Ralph S Baric; Michael G Katze; Katrina M Waters
Journal: BMC Syst Biol Date: 2016-09-23

10. Identification of large disjoint motifs in biological networks.

Authors: Rasha Elhesha; Tamer Kahveci
Journal: BMC Bioinformatics Date: 2016-10-06 Impact factor: 3.169