Literature DB >> 22140454

BinTree seeking: a novel approach to mine both bi-sparse and cohesive modules in protein interaction networks.

Qing-Ju Jiao¹, Yan-Kai Zhang, Lu-Ning Li, Hong-Bin Shen.

Abstract

Modern science of networks has brought significant advances to our understanding of complex systems biology. As a representative model of systems biology, Protein Interaction Networks (PINs) are characterized by a remarkable modular structures, reflecting functional associations between their components. Many methods were proposed to capture cohesive modules so that there is a higher density of edges within modules than those across them. Recent studies reveal that cohesively interacting modules of proteins is not a universal organizing principle in PINs, which has opened up new avenues for revisiting functional modules in PINs. In this paper, functional clusters in PINs are found to be able to form unorthodox structures defined as bi-sparse module. In contrast to the traditional cohesive module, the nodes in the bi-sparse module are sparsely connected internally and densely connected with other bi-sparse or cohesive modules. We present a novel protocol called the BinTree Seeking (BTS) for mining both bi-sparse and cohesive modules in PINs based on Edge Density of Module (EDM) and matrix theory. BTS detects modules by depicting links and nodes rather than nodes alone and its derivation procedure is totally performed on adjacency matrix of networks. The number of modules in a PIN can be automatically determined in the proposed BTS approach. BTS is tested on three real PINs and the results demonstrate that functional modules in PINs are not dominantly cohesive but can be sparse. BTS software and the supporting information are available at: www.csbio.sjtu.edu.cn/bioinf/BTS/.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2011 PMID： 22140454 PMCID： PMC3225364 DOI： 10.1371/journal.pone.0027646

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Most biological characteristics arise from complex interactions between the cell's numerous constituents, such as proteins, DNA, RNA, and small molecules [1]–[5]. Therefore, a great challenge in systems biology is to understand the structure and the dynamics of the complex intercellular networks of interactions that contribute to the structure and function of a living cell [5]. Biological functions seldom rely on individual proteins to perform particular cellular tasks; quite on the contrary, they are generally discovered from interactions among multiple members to form highly-organized modules, where proteins often interact intimately and intensively [6]. Modules are of interest because they often correspond to functional subunits [5], such as protein complexes [6], [7] or social spheres [8]. Revealing these modular constituents in networks will undoubtedly bring richer biological information in gaining insights into dynamic of molecular systems on a new landscape. As a representative example in complex biological systems, PIN is widely used to predict protein functions [9]–[11] because its dynamic and modular structures are considered to be capable of providing more significant and direct evidences in formation of protein functions. One of the examples is known as the automatic protein complex prediction method, where protein complexes generally correspond to clusters in a PIN because proteins in a complex are strongly interactive with each other [12]. Considering the importance of the module information buried in a PIN, a number of mathematical and computer algorithms have been proposed to tackle module and protein complex detections in protein interaction networks [6], [13]–[18]. However, it has been revealed that the cohesive modules did not completely depict various functional units in PINs. In 2007, Wang et al analyzed the yeast PINs including PIC network that includes protein complex data and PEC network that excludes all edges inferred from protein complexes, and they found that the identified modules lack obvious correspondence to functional units [19]. In 2010, Pinkert et al presented an alternative approach different from prior definitions of what actually constitutes a “module” to detect functional modules in PINs. They applied the method (denoted as Pinkert method in the following section) to the PIN from the Human Protein Reference Database (HPRD) and found some self-linking and isolated nodes that were proved to be functional modules [20]. What's more, the authors found some significant non-diagonal modules, which were functionally related and can provide better description for the characteristics of a protein interaction network than cohesive modules alone. Therefore, the common notion that cohesive module is considered as the sole organizing structure for functional unit is challenged. A Simulated Annealing (SA) based algorithm was also proposed in [20] for the purpose of finding both cohesive and sparse modules. Although this method was demonstrated effective, it is highly dependent on the parameters chosen for optimization in SA, for example, initial temperature and cooling factor, where the most difficult parameter could be the number of modules in the network should be predefined. By setting different number of clusters, one can get totally different outputs. This parameter is particularly hard to be set properly when the network size is large. Another disadvantage of optimizing modularity E-value by SA [20] is for diagonal and non-diagonal modules, the over-split phenomena can't be avoided in the whole process (Figure 1).

Figure 1

An example of the over-split results for diagonal and non-diagonal modules.

The over-split issue in the Pinkert method means that the error function E value does not change if a big diagonal or non-diagonal module is split two or multi modules.

An example of the over-split results for diagonal and non-diagonal modules.

The over-split issue in the Pinkert method means that the error function E value does not change if a big diagonal or non-diagonal module is split two or multi modules. Unlike previous approaches that extract clusters or modules by identifying groups of proteins with similar patterns of interaction to other proteins, this paper focuses on an unorthodox structure of module that is defined as bi-sparse module. The members in bi-sparse module are sparsely connected internally and densely connected with other bi-sparse or cohesive modules. Accordingly, we proposed a BinTree Seeking (BTS) method based on the Edge Density of Module (EDM) and binary tree theory to mine both bi-sparse and cohesive functional modules. Different from the existing literatures, which focus on grouping nodes [21] or optimizing modularity [16], [22]–[24] in networks, the new BTS method takes full advantage of the relationship between network edges and nodes and binary search tree theory. Another merit of BTS approach is that it does not need to set the number of modules beforehand and this important parameter can be automatically identified in BTS based on a given evaluation criterion. By applying the BTS method to analyze the protein Kinase and Phosphatase Interaction Network (KPIN) [25], a human protein interaction network from the I2D database [26], and a yeast interaction network from DIP database [27], we finally obtain functional clusters composed of both cohesive and bi-sparse modules.

Results

The results by applying BTS on synthetic network

Detection of blocks is a classic issue in complex network studies and many methods were proposed in the literature [28]–[30]. The outputs from traditional approaches are dominantly cohesive clusters in the objective network, which are considered functional important. As a significant complement, it has been revealed recently that sparse module also could be important functional units although the links among their members are very sparse [20]. A synthetic benchmark network that is composed of 128 nodes was constructed consisting of four modules, two of which are cohesive clusters and the other two form bi-partite structures. In order to effectively demonstrate the robustness of the proposed BTS method, 5 noisy complex networks with noise level of 0.1∼0.5 were constructed by adding noise to the original benchmark data (Figure 2(A)–(E)), where the way to add noise is the same as described in the Pinkert method [20]. The proposed BTS and the Pinkert method (the number of classes is set to be 4) were both employed to mine the clusters in these 5 noisy networks. Figure 2(F) compares the E-value results from the two methods respectively. From the results we can find that BTS can get smaller E-values on 3 tested networks of noise level equal to 0.1, 0.3, and 0.4; Pinkert method performs better on the other two networks. Our experiments also show that E-values in the Pinkert method can be changed dramatically when the number of classes is set to other values.

Figure 2

The E values for the two methods on 5 complex networks.

Synthetic networks composed of 2 cohesive clusters and 2 bi-partite structures: (A) with 10% noise; (B) with 20% noise; (C) with 30% noise; (D) with 40% noise; (E) with 50% noise; and (F) E-value comparison results between Pinkert method and proposed BTS.

The E values for the two methods on 5 complex networks.

The results of applying BTS to real PINs data

By using the proposed BTS method, we analyzed the KPI protein interaction network, DIP yeast protein interaction network, and BIND human protein interaction network. As a result, we get 29, 59, and 65 modules respectively on these three PINs (see Figure S1 for details). Different module quality control criterions are available in the literature, for example, the concepts of structural equivalence [31], Newman modularity Q [23], [32], and the E value that describes the connection structure of the original network [20], [33] (see Appendix S2 for definitions of Q and E function). The former two are found as special cases of the E-value used by the Pinkert approach. Therefore, we mainly focus on the comparison of the final E values computed by the proposed BTS method and others. In the Pinkert method, the E-values significantly depend on the predefined number of modules or clusters q and it is still not clear how to determine and select q, which is usually identified by trying different choices. Therefore, in this study, we use the same strategy as in [20] by testing different selections of q, i.e., q = 5 to q = 25, 50, and 100. Figure 3 illustrates the relationship between E and q on the three PINs studied in this paper. From Figure 3, we get an impression that the E values tend to decrease when q increases. In this paper, the typical q = 5, 25, 50 and 100 were selected and their corresponding E values were compared with the BTS method. In addition, the q = 29, 65, and 59 were also set in the Pinkert method on KPI PIN, BIND human PIN, and DIP yeast core PIN respectively because these q values were equal to the outputs from BTS method. The Figure 4 shows the results of comparative E values. As can be seen from Figure 4, the BTS method yields the smaller E values compared with the Pinkert method in BIND human PIN and DIP yeast core PIN (apart from q = 100), which is better according to the definitions of E. In KPI PIN, the E-values by BTS are larger than those generated by Pinkert method in most q selection cases. This could be the existence of some large bi-sparse and cohesive functionally related modules that will be proved by following sections.

Figure 3

The relationship between E and q on three PINs in the Pinkert method.

Figure 4

E values comparison results between BTS and Pinkert methods on 3 different PINs.

29, 65, and 59 modules were identified by BTS on the three PINs respectively and results of 5 different number of clusters q of Pinkert method on each PIN were reported.

E values comparison results between BTS and Pinkert methods on 3 different PINs.

29, 65, and 59 modules were identified by BTS on the three PINs respectively and results of 5 different number of clusters q of Pinkert method on each PIN were reported. In order to evaluate the functional meaningfulness of the obtained modules by the BTS method, Newman-fast method, and the Pinkert method, we performed Gene Ontology (GO) [34] enrichment analysis for all modules using the BiNGO tool [35], which was incorporated into the Cytoscape platform [36]. Based on the BiNGO tool, the number of the modules with no significant annotations and the p-values (biological process BP) of all modules are compared. The cumulative distribution frequency of all modules detected by three approaches is employed to explain the results of the p-values (see Figure S1 for the detailed results of cumulative distribution frequency and P-values). The performance comparisons are presented in Figures 5, 6, and 7 where it is generally considered to be better if the area under the corresponding curve is larger. As can be seen, the two areas captured by BTS method and Newman method are nearly equal in Figure 5. In Figures 6 and 7, the BTS method achieves the largest area in three methods apart from some cases such as the results generated by the Pinkert method with q = 5.