Guangming Liu1, Bianfang Chai2, Kuo Yang1, Jian Yu1, Xuezhong Zhou3. 1. Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, No. 3 Shangyuancun Haidian District, Beijing, People's Republic of China. 2. Department of Information Engineering, Hebei GEO University, Shijiazhuang, People's Republic of China. 3. Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, No. 3 Shangyuancun Haidian District, Beijing, People's Republic of China. xzzhou@bjtu.edu.cn.
Abstract
A large amount of available protein-protein interaction (PPI) data has been generated by high-throughput experimental techniques. Uncovering functional modules from PPI networks will help us better understand the underlying mechanisms of cellular functions. Numerous computational algorithms have been designed to identify functional modules automatically in the past decades. However, most community detection methods (non-overlapping or overlapping types) are unsupervised models, which cannot incorporate the well-known protein complexes as a priori. The authors propose a novel semi-supervised model named pairwise constrains nonnegative matrix tri-factorisation (PCNMTF), which takes full advantage of the well-known protein complexes to find overlapping functional modules based on protein module indicator matrix and module correlation matrix simultaneously from PPI networks. PCNMTF determinately models and learns the mixed module memberships of each protein by considering the correlation among modules simultaneously based on the non-negative matrix tri-factorisation. The experiment results on both synthetic and real-world biological networks demonstrate that PCNMTF gains more precise functional modules than that of state-of-the-art methods.
A large amount of available protein-protein interaction (PPI) data has been generated by high-throughput experimental techniques. Uncovering functional modules from PPI networks will help us better understand the underlying mechanisms of cellular functions. Numerous computational algorithms have been designed to identify functional modules automatically in the past decades. However, most community detection methods (non-overlapping or overlapping types) are unsupervised models, which cannot incorporate the well-known protein complexes as a priori. The authors propose a novel semi-supervised model named pairwise constrains nonnegative matrix tri-factorisation (PCNMTF), which takes full advantage of the well-known protein complexes to find overlapping functional modules based on protein module indicator matrix and module correlation matrix simultaneously from PPI networks. PCNMTF determinately models and learns the mixed module memberships of each protein by considering the correlation among modules simultaneously based on the non-negative matrix tri-factorisation. The experiment results on both synthetic and real-world biological networks demonstrate that PCNMTF gains more precise functional modules than that of state-of-the-art methods.
Protein seldom exerts its biological function as unitary independent entity but usually plays as an organised group or functional module [1]. With the development of high‐throughput experiment technology, such as mass spectrometry [2, 3], two‐hybrid systems [4, 5], large amounts of protein‐protein interaction (PPI) data are available which makes it possible to reveal the fundamental regular patterns of the cellular systems. Generally, these PPI data sets are expressed as undirected networks in which proteins act as a collection of vertices, and interactions between pairs of proteins play as a set of links [6]. In addition, protein networks have different topological qualities, including: (i) small‐word property [7], (ii) scale‐free degree distribution [8], and (iii) functional modular organisation [9]. Therefore we need to detect functional modules in PPI networks to discover the underlying mechanisms of cellular functions.Proteins interacted with each other usually are more likely to partake the same or similar biological functions than those not interacted with each other [10]. Hence, the closely connected regions in PPI networks can be regarded as functional modules. To address this problem, a plenty of advanced computing approaches have been proposed to identify densely linked sub‐graphs automated as functional modules (or protein complexes) in recent biological researches [11, 12]. In terms of the detected modules, the functional module detection methods can be divided into two categories: non‐overlapping and overlapping algorithms.An entropy‐based functional module detection method has been proposed by Kenley [13] in which a protein was selected randomly as a seed and then absorbs its neighbours to form an original module, then proteins that are adjacent to this module were added or removed according to the increase or decrease of entropy. UVCluster [14], proposed by Arnau et al., is a hierarchical clustering method based on the shortest path between pairs of proteins.In recent years, a plenty of overlapping module detection methods have been proposed [12, 15–17]. Xiang et al. [17] have proposed a weighted gene co‐expression network analysis algorithm to identify overlapping modules related to glioblastoma multiforme prognosis. Bader and Hogue have proposed a functional module detection algorithm named MCODE [18] which identifies functional modules by fully employing the degree of proteins. Another well‐known overlapping functional modules detection algorithm named CFinder has been developed by Adamcsek et al. [15] which uncovers k‐cliques by utilising clique percolation [16] firstly and then merges the adjacent k‐cliques into the functional modules. Nepusz et al. [12] have proposed an overlapping protein module detection method called ClusterONE, in which the proteins accompany with the highest degree were selected as seeds firstly and then their neighbour nodes are decided to append or remove from them measured by a cohesiveness score.There are some algorithms have the ability of detecting both non‐overlapping and overlapping modules, such as non‐negative matrix factorisation (NMF)‐based methods. NMF is a broadly used matrix decomposition approach which factorises an original non‐negative matrix into two non‐negative matrices with low rank and it has been successfully applied in text, image, natural language analysis [19] and functional module detection [20]. Nevertheless, the physical meaning of the two factorised matrices is ambiguous. Luckily, non‐negative matrix tri‐factorisation (NMTF) has been proposed which can assign a clear physical meaning to each factorised matrix and we will introduce it in Section 2.2. Wang et al. [21] have used NMTF to co‐cluster multi‐type relational data simultaneously, Zhu et al. [22] have used NMTF to analyse both user‐level and tweet‐level sentiments on social media and Pei et al. [23] utilised NMTF to detect community structure in social networks. All these three works are unsupervised methods and the performance of them depending on the selection of similarity function which was used as manifold regularisation terms.Only topological information is considered by the above‐mentioned methods; however, PPI data acquired from high‐throughput biological experiments is incomplete [24], and a plenty of noise and error interactions exist in these sparse PPI networks. For instance, the percentage of false‐positive interactions is occasionally up to 50% [25]. Therefore, protein module detection methods which are simply based on topological structure may not obtain accurate functional modules. Fortunately, some manually curated protein complex databases, such as CORUM [26], are available and in high quality. Compared to PPI, the number of proteins in protein complexes is small but these complexes can be viewed as prior information to help address the limitations of PPI for functional module detection.To address these limitations of PPI networks, we propose a novel semi‐supervised model named pairwise constrained non‐negative matrix tri‐factorisation (PCNMTF) which uses known high‐quality protein complexes as prior information to identify functional modules more precisely than unsupervised methods. We expect to uncover new functional modules from PPI networks using prior information. Some of the detected modules are contained and some are not contained in the complex database. We first extract must‐link constraints from protein complexes, where a pair of proteins within a same complex indicates a must‐link constraint. Then these limited constraints are used to guide the factorising iteration. The main contributions of this work including: (1) we present a novel semi‐supervised functional module detection model PCNMTF which makes full use of known protein complexes as prior information to help detecting functional modules; (ii) a Frobenius constraint is imposed on community relationship matrix
to make the solution stable; (iii) different from existing NMF and NMTF methods, the module membership of a protein is decided not only based on the indicator matrix but also in terms of the module relationship matrix.
2 Related work
Models based on NMF [27] and NMTF [28] have been successfully used in community detection in recent years. There are roughly two kinds of algorithms: unsupervised and supervised (or semi‐supervised) methods. Given a similarity matrix
of a network, the module memberships of nodes are derived from it. In this section, we first introduce several classic similarity matrix calculation methods and then introduce the unsupervised and semi‐supervised NMF models for module detection.
2.1 Similarity matrix of a network
Extracting similarity matrix
of nodes from the topological information is a fundamental task. There are three methods to construct similarity matrix
: (i) Adjacency matrix. Using the adjacent matrix
directly as similarity matrix
or construct the matrix
based on matrix
, such as , where is a parameter to control the role of [29]. (ii) Shortest path. If is the shortest path from node i to node j then [30], where k is a constant. (iii) Diffusion kernel feature matrix. First, an opposite Laplacian matrix
is constructed according to a network as follows:
where is the degree of vertex i. Then define the exponential of matrix
as , where is a positive parameter to control the extent of diffusion. Finally, the similarity matrix
is acquired by [31].
2.2 Unsupervised NMF
The unsupervised methods only focus on utilising topological structure of network to detect modules. Thus, the similarity matrix is viewed as an original input matrix, the NMF aims to factorise
into two non‐negative low rank matrices and , where . We use the Euclidean distance to quantify the quality of the approximation gained by product
and
. The objective function is defined as . Meanwhile, a symmetric NMF (SNMF) has been proposed to identify community structure since
is a symmetric similarity matrix and its objective function is defined as .Since both NMF and SNMF do not considered the relationships between modules, then NMTF is designed to uncover underlying modules from networks which is formulated as , where is the node membership indicator matrix and represents the relationship between modules. NMTF can give any connection between two nodes in one network by the term while NMF and SNMF cannot.
2.3 Semi‐supervised NMF
In real‐world applications, some prior information is easily obtained with pairwise form which can be used to improve the performance of community detection algorithms. In recent years, a plenty of algorithms have been proposed to incorporate these prior information to aid detecting modules. Zhang et al. [32] have designed a model which used the must‐link constraints to enhance adjacent matrix
so that a novel adjacent matrix is defined as follows:
Based on the new adjacent matrix , NMF, SNMF and NMTF are able to identify modules from PPI networks. Yang et al. [33] have proposed a semi‐supervised module detection framework which combines NMF and SNMF with pair‐wise constraints to uncover communities. The objective functions are and , where
is the Laplacian matrix of pair‐wise constraints and is a positive parameter to balance the tradeoff between topology structure and must‐link information.
3 Functional module detection based on PCNMTF
Notations: A PPI network can be formed typically as an undirected graph in which represents the proteins, and E denotes the edge set which represents the interactions between protein pairs. Let an non‐negative symmetric matrix denote the adjacency matrix of graph P, generally, the element denotes whether an interaction is existed between the ith protein and jth protein. For convenience, we set if and only if protein interacts with protein , and otherwise.
3.1 Problem statement
Given an adjacency matrix of a PPI network and a known protein complex database, we extract pair‐wise information, must‐link constraint, from complex database. Thus, a must‐link matrix is built based on these must‐link constraints. The goal of the proposed semi‐supervised module detection model in this work is trying to find protein module membership matrix and module relationship matrix with the given information, adjacency matrix and must‐link matrix . We attempt to explore an objective function based on matrix factorisation which can identify underlying module structures from PPI networks and the objective function that needs to be minimised is defined as follows:
The first term indicates the deviation between product of and and the adjacency matrix
, the second term denotes the penalty term of must‐link constraints and the last term is a regularisation term on .
3.2 Matrix tri‐factorisation
The interactions between protein pairs are rare in human PPI networks [24, 34] at present. Thus, the corresponding graph P with respect to these incomplete interactions is considerable sparse. If a feature‐vector of one protein is assigned directly by each row in adjacency matrix
, the time consuming will be expensive due to the high dimensionality which is equal to the number of proteins in the whole PPI network. Furthermore, the performance of module detection in terms of this feature vector is unsatisfactory [35]. The NMF [30, 36] and SNMF [37] models have been proposed since they are able to explore a high‐quality lower dimensional feature as the new representation for each protein in PPI network. What is more, previous studies have confirmed that NMF models offer obviously advantages in detecting modules within biological network [38]. However, the correlation between modules which denotes the interactions between modules, there will be more interactions between two overlapped modules than those non‐overlapped, has not been considered when assigning module membership to a protein that may lead to an inaccurate module division result.To overcome the drawback of NMF, then NMTF is employed in this paper and the objective function is defined as follows:
where is an matrix representing the module membership of proteins (k is the maximum possible number of modules) and the element represents the probability that node i should be belonged to module j, is a symmetric matrix denoting the correlations between any module pairs. The product of indicates the relationship between any two proteins in accordance with module structure. denotes the Frobenius norm. Since the adjacency matrix
is positive, the non‐negative constrains are also added to matrixes
and
simultaneously.
3.3 Pairwise constrained
Protein complex is a group of proteins that interact with each other densely and tend to share similar biological functions [39]. Intuitively, the proteins within a same complex should be considered to be clustered into a same module and then the must‐link constraints are generated according to these proteins. Therefore, the must‐link constrained matrix is constructed in terms of extracted must‐link constraints, where if protein i and protein j co‐occur in one common protein complex and otherwise. The module membership of any protein pair, protein i and j, with must‐link constraint should be similar as much as possible, which means the difference between the ith row and jth row in the module indicator matrix
should be as small as possible. In this paper, the square distance between two vectors is used to measure the similarity between them, which is denoted as .The must‐link constraints which are used as prior information can be formulated as follows:
where is a diagonal matrix about matrix
() and is the Laplacian matrix of matrix
, indicates the trace of a matrix.
3.4 PCNMTF
There is a plenty of ways to make use of both topological information and pairwise constraints simultaneously for protein module detection. The main idea in this work is to use pairwise constraints as a penalty term rather than simply to incorporate the prior information into the original PPI network, then the objective function will be subjected to a penalty if the must‐link constraints are not satisfied. To address this issue, the objective function of the proposed model PCNMTF is defined as follows:
where is a parameter with the function of balancing the tradeoff between prior knowledge formulated as must‐link constraints and topological structure of PPI network. Furthermore, the Frobenius norm is imposed on matrix
as a regularisation term that is used to generate stable solutions for (6) and prevent overfitting, is a smoothing parameter.Although the proposed model PCNMTF is similar with previous studies which are proposed by Wang et al. [28], Zhang et al. [32] and Yang et al. [33], it is quite different among them in several aspects. Wang's method only concerned on topological information without considering prior information to discover modules from networks, it is difficult to detect modules accurately from networks with no clear modular structures. Zhang et al. directly used must‐link constraint to modify the adjacency matrix; however, the enhanced adjacency matrix did not guarantee that a node pair with must‐link constraint can be clustered into a same module. Yang et al. proposed a semi‐supervised framework based on NMF to uncover modules; however, the physical meaning of the two factorised matrices were not clear and the relationship between modules was not learned. Our proposed model PCNMTF utilised prior information to guide the learning process of protein membership matrix and module relationship matrix simultaneously. Furthermore, we proposed a novel overlapping module detection method by considering these two matrices at the same time.Using the knowledge of trace as follows: , and , then (6) is rewritten as follows:
In order to satisfy non‐negative constraints and , we brought in two Lagrange multipliers and separately, then the Lagrange function of (7) is rewritten as follows:
Since (8) is non‐convex in terms of both matrixes
and
as variables simultaneously, in order to minimise function J, we first acquired the partial derivative against matrixes
and
, respectively, as follows:
then let (9) equal to zero and used the KKT conditions and , then the updating rules of protein indicator matrix
and module relationship matrix
were given as follows:
where means the element‐wise multiplication between two matrices. In order to minimise (6), the updating strategy employed is to update one matrix while keeping another unchanged iteratively. The iterative process will be terminated when the objective function is converged or the number of iteration bigger than a given threshold. We lay out the proposed PCNMTF model in Algorithm 1 (see Fig. 1.
Fig. 1
Algorithm 1: The proposed PCNMTF
Algorithm 1: The proposed PCNMTF
3.5 Overlapping module detection
We developed a novel overlapping module detection method with using both protein module membership matrix
and module relationship matrix
. The dimensions of each row in matrix
is k, which is equal to the number of all possible modules, the element denotes the membership strength how protein i serves to module j [37]. Intuitively, if protein i belongs to multiple modules, there must exist some relationship among them to some extent. Then, for protein i, we first assign it to module c to which it most likely belongs when module c meets . Furthermore, in addition to module c, we also consider clustering protein i into another module j if the following conditions are satisfied in the mean time: and where is the element in matrix
which denotes the relationship between module j and module c. As a consequence, each protein in PPI network can be clustered into one or more modules effectively and efficiently. In this manuscript, the value of threshold is set equal to 0.2 by experience as a similar way of Zhang's work [40].
4 Experimental results
4.1 Data sets
We introduce two common synthetic networks to verify the effectiveness of the proposed model PCNMTF. Girvan and Newman [41] design a synthetic network benchmark generator, each network (denoted as GN network) contains 128 nodes which are belonged to four modules. The average degree of each node is 16. For each node, let indicate the number of edges randomly linked to it in its own module and denote the amount of links randomly connected to it in other modules, obviously, . As the value of increases the modular structure becomes less clear. Previous studies have proved that when the modular structure of the generated networks becomes vague and most state‐of‐the‐art methods are difficult to identify modules from these networks accurately. In this work, we set and then generated 100 networks with benchmark randomly. The average benchmark modularity of these GN8 networks is 0.27. Lancichinetti et al. [42] developed another well‐used artificial network benchmark generator (denoted as LFR network), it provides several parameters to control the properties of generated networks, such as the number of nodes (n), the average degree of each node (ad), the maximum degree of each node (md), the minimum module size (), the maximum module size () and a mixing parameter (mp) which represents the fraction edges between modules. Similar to in GN networks, a larger mp leads to a more unclear modular structure network. In our experiment, we set n = 1000, ad = 15, md = 50, , and mp = 0.7, and then we generated 100 networks with benchmark (denoted as LFR) randomly and the average benchmark modularity of these LFR networks is 0.26.Two human related PPI networks are used in our work, one is derived from database of interacting proteins (DIP) [43] human subset and the other is human protein reference database (HPRD) [44]. Two protein complex databases are used in this work. The first one is CORUM [26] which concerns the protein complexes in mammalian, thus, the protein complexes which are not existed in human organism are filtered out in this study. The second one is PCDq [45] which concerns the human related protein complexes. The protein complexes which have less than three proteins are filtered out in our experiments. The complexes and proteins coverage of the two human related PPI networks by these two complex databases and the properties of PPI networks are listed in Table 1, where #p and #e denote the number of proteins and edges in PPI network, respectively, #cc and #cp denote the number of coverage complexes and proteins of PPI network by complex database, respectively, #as, #ai and #ad denote average size, average number of interactions and average degree of complexes, respectively.
Table 1
Properties of human networks and complexes
Network
#p
#e
CORUM
PCDq
#cc
#cp
#as
#ai
#ad
#cc
#cp
#as
#ai
#ad
DIP
2943
4673
746
1018
5.51
3.43
6.85
340
1090
4.58
2.22
5.83
HPRD
9453
36,888
1069
1823
5.76
5.47
30.49
874
2892
4.39
3.96
23.72
Properties of human networks and complexes
4.2 Evaluation metrics
Since each node in the two artificial networks mentioned above has specific community membership, then the normalized mutual information (NMI) [46] and accuracy are employed to measure the quality of detected modules. The accuracy metric is used to evaluate the percentage of nodes with correct module membership identified by the community detection method. Let and denote the ground‐truth label and detected label for node i, the accuracy can be defined as
where if x = y, or if , is a function that maps each detected label to the equivalent ground‐truth label which is implemented by Kuhn–Munkres algorithm [47]. The NMI metric is used to measure the similarity between ground‐truth module sets and detected module sets and is defined as follows:
where denotes the number of proteins in the ground‐truth gth module and is the number of proteins in the detected dth module , n is the total number of proteins in PPI network, is the number of proteins overlapped between module and .As for the human related PPI network, the precision, recall and F‐measure metrics are utilised to assess the quality of detected modules. The extent of overlapping between gold‐standard complexes and detected module sets is presented as follows:
where indicates the size of one known protein complex, denotes the size of one detected protein module, and is the quantity of overlapped proteins between them. If , the two sets p and d are considered to be matched each other. In this paper, we assign with the same manner of previous studies [12, 37]. Then the precision, recall and F‐measure are defined as follows:
where F‐measure is the harmonic mean of recall and precision.
4.3 Performance on synthetic networks
To evaluate the module identification capability of our proposed algorithm PCNMTF, seven well‐known state‐of‐the‐art NMF‐based community detection methods and two non‐NMF‐based methods are employed to compare with our method. The compared seven NMF‐based methods include NMF [48], pair‐wise constrained NMF (PCNMF) [33], symmetric NMF (SNMF) [28], pair‐wise constrained SNMF (PCSNMF) [33], NMTF [49], NMTF with Jacarrd similarity matrix (MNTFJAC) and NMFADJ [21-23]. The graph regularisation term used in NMTFJAC is based on Jaccard similarity between two proteins. The proteins linked to each other in PPI network are thought to have similar functions then the adjacency matrix is viewed as a similarity matrix which is served to NMTFADJ. The two non‐NMF‐based methods are K‐rank‐D [50] and MCODE [18].The NMI and accuracy metrics are used to evaluate the performance of module detection methods, and the parameters of PCNMF and PCSNFM are chosen to obtain the best results. The parameter which is used to balance the tradeoff between topology information and prior information of PCNMF and PCSNMF set equal to 10 and 100 separately. Note that, when , the PCNMF is equivalent to NMF and PCSNMF is equivalent to SNMF. The sensitivity analysis of the two parameters and are conducted in Section 4.5. Then we set the smoothing parameter and for the proposed method PCNMTF. indicates that must‐link constraints play an important role in detecting modules from complicated networks which is consistent with previous studies [33, 51, 52].The must‐link constraints are extracted from benchmark modules with the same way of Yang's work [33]. Suppose that there are N nodes in one module, the possible number of node pairs with must‐link constraint is . The percentage of node pairs with must‐link constraints are based on in this section. Tables 2 and 3 illustrate the accuracy of modules detected by different methods in term of various percentage prior information. Figs. 2 and b display the NMI of different algorithms with various percentage of prior information. Both the accuracy and NMI of all supervised algorithms have been improved consistently with the increase of must‐link information. The proposed model PCNMTF has the best performance which has the rapidly growth trend. The NMI and accuracy of PCNMTF approach to 1 rapidly when the percentage of must‐link information exceeds 10% on GN8 networks and 15% on LFR networks, which means PCNMTF can identify modules effectively and efficiently from the network with unclear modular structure. The most significant improvement of PCNMTF is due to making full use of must‐link information and module correlation simultaneously.
Table 2
Accuracy of compared methods with different percentage of must‐link constraints on GN8
Method
0.05
0.1
0.15
0.2
0.25
0.3
MCODE
0.794 ± 0.01
0.794 ± 0.01
0.794 ± 0.01
0.794 ± 0.01
0.794 ± 0.01
0.794 ± 0.01
K‐rank‐D
0.739 ± 0.05
0.739 ± 0.05
0.739 ± 0.05
0.739 ± 0.05
0.739 ± 0.05
0.739 ± 0.05
NMF
0.859 ± 0.03
0.859 ± 0.03
0.859 ± 0.03
0.859 ± 0.03
0.859 ± 0.03
0.859 ± 0.03
PCNMF
0.867 ± 0.03
0.961 ± 0.02
1.000 ± 0.01
1.000 ± 0.00
1.000 ± 0.00
1.000 ± 0.01
SNMF
0.867 ± 0.01
0.867 ± 0.01
0.867 ± 0.01
0.867 ± 0.01
0.867 ± 0.01
0.867 ± 0.01
PCSNMF
0.937 ± 0.01
0.984 ± 0.00
0.993 ± 0.00
1.000 ± 0.01
1.000 ± 0.00
1.000 ± 0.00
NMTF
0.862 ± 0.02
0.862 ± 0.02
0.862 ± 0.02
0.862 ± 0.02
0.862 ± 0.02
0.862 ± 0.02
NMTFADJ
0.859 ± 0.01
0.859 ± 0.01
0.859 ± 0.01
0.859 ± 0.01
0.859 ± 0.01
0.859 ± 0.01
NMTFJAC
0.846 ± 0.03
0.846 ± 0.03
0.846 ± 0.03
0.846 ± 0.03
0.846 ± 0.03
0.846 ± 0.03
PCNMTF
0.997 ± 0.01
1.000 ± 0.01
1.000 ± 0.00
1.000 ± 0.02
1.000 ± 0.00
1.000 ± 0.00
Table 3
Accuracy of compared methods with different percentage of must‐link constraints on LFR
Method
0.05
0.1
0.15
0.2
0.25
0.3
MCODE
0.46 ± 0.03
0.46 ± 0.03
0.46 ± 0.03
0.46 ± 0.03
0.46 ± 0.03
0.46 ± 0.03
K‐rank‐D
0.53 ± 0.06
0.53 ± 0.06
0.53 ± 0.06
0.53 ± 0.06
0.53 ± 0.06
0.53 ± 0.06
NMF
0.393 ± 0.04
0.393 ± 0.04
0.393 ± 0.04
0.393 ± 0.04
0.393 ± 0.04
0.393 ± 0.04
PCNMF
0.431 ± 0.02
0.359 ± 0.02
0.776 ± 0.01
0.679 ± 0.04
0.671 ± 0.02
0.635 ± 0.06
SNMF
0.567 ± 0.02
0.567 ± 0.02
0.567 ± 0.02
0.567 ± 0.02
0.567 ± 0.02
0.567 ± 0.02
PCSNMF
0.591 ± 0.02
0.856 ± 0.01
0.977 ± 0.05
0.994 ± 0.02
1.000 ± 0.01
1.000 ± 0.03
NMTF
0.571 ± 0.04
0.571 ± 0.04
0.571 ± 0.04
0.571 ± 0.04
0.571 ± 0.04
0.571 ± 0.04
NMTFADJ
0.542 ± 0.03
0.542 ± 0.03
0.542 ± 0.03
0.542 ± 0.03
0.542 ± 0.03
0.542 ± 0.03
NMTFJAC
0.475 ± 0.05
0.475 ± 0.05
0.475 ± 0.05
0.475 ± 0.05
0.475 ± 0.05
0.475 ± 0.05
PCNMTF
0.639 ± 0.01
0.922 ± 0.02
0.999 ± 0.01
1.000 ± 0.02
1.000 ± 0.03
1.000 ± 0.01
Fig. 2
NMI of different methods with different percentage of must‐link constraints derived from ground‐truth
GN8 network,
LFR network
Accuracy of compared methods with different percentage of must‐link constraints on GN8Accuracy of compared methods with different percentage of must‐link constraints on LFRNMI of different methods with different percentage of must‐link constraints derived from ground‐truthGN8 network,
LFR network
4.4 Performance on human PPI networks
The must‐link constraints can improve the performance of detecting modules from networks, then the proposed model PCNMTF was used to detect protein functional modules on two human‐related PPI networks, DIP and HPRD, with the same parameter settings as discussed in Section 4.3.
4.4.1 Must‐link constraints
The must‐link prior information is extracted from two known protein complex databases, CORUM and PCDq. Since the protein complexes are overlapped, the proteins included in more than one complex are not considered when we extracted must‐link constraints from protein complexes. For each protein complex, the proteins only contained in one complex are used to extract must‐link constraints and the number of corresponding proteins are denoted as . Then we extracted protein pairs with must‐link constraint. However, the must‐link constraint only provides the information about that the two corresponding proteins should belong to one module rather than clarify to which module they should belong. In this work, the must‐link constraints are extracted from CORUM. Thus, 803 must‐link constrains with 470 proteins and 2876 must‐link constraints with 997 proteins are extracted for DIP and HPRD, respectively.
4.4.2 Detected modules
One challenge is how to determine the amount of modules, k, because of there is no prior knowledge about the number of modules in real PPI network. The NMF‐based methods usually assign community membership according to the real value of the row of matrix
for each node, if there is no value bigger than a given threshold for a specific column of matrix
and then the corresponding module of this column will be omitted. Therefore, we can fit the proposed model PCNMTF with a larger value of k as it is able to identify the amount of modules adaptively. We set the value of k equal to 500 and 1000 for DIP and HPRD, respectively, in this paper. In this work, the detected modules with size smaller than 2 are filtered out. The compared results of all methods used in this paper are reported in Table 4 where coverage is the number of detected proteins, #as, #ad and #ai indicate the average size, average degree and average interactions in the detected modules, #m is the number of detected modules, #mm is the number of modules matched with known complexes, #ai_ma is the average interactions of matched modules and #ai_ml denotes the average interactions in matched modules but not in must‐link constraints. To evaluate the performance of PCNMTF on detecting functional modules, we first compared the detected modules with known complexes and then we conducted enrichment analysis to evaluate the functional homogeneity of detected modules.
Table 4
Information of modules detected by all compared methods on DIP and HRPD
Network
Method
Coverage
#as
#ad
#ai
#m
CORUM
PCDq
#mm
#ai_ma
#ai_ml
#mm
#ai_ma
#ai_ml
DIP
MCODE
421
5.19
5.82
7.71
81
47
11.93
10.80
49
8.68
8.11
K‐rank‐D
1666
12.34
2.73
11.25
135
60
10.01
9.16
65
9.22
8.65
NMF
2679
11.26
3.69
3.73
255
102
2.80
1.61
109
2.21
1.81
PCNMF
2748
7.33
3.46
2.81
375
178
1.37
2.11
164
2.07
1.93
SNMF
1873
7.28
3.80
2.09
294
148
3.70
2.30
111
2.25
1.95
PCSNMF
2876
6.21
3.48
5.93
463
229
6.20
5.63
215
6.05
5.72
NMTF
2766
8.26
3.59
2.64
335
155
1.63
1.40
138
1.42
1.26
NMTFADJ
2701
9.72
3.63
3.26
278
138
2.50
1.65
95
1.61
1.36
NMTFJAC
2874
9.97
3.12
4.38
223
96
1.94
1.38
83
1.60
1.13
PCNMTF
2920
10.50
3.52
9.53
278
137
11.20
9.50
133
8.96
8.44
HPRD
MCODE
1161
11.38
16.58
18.77
102
37
27.25
25.68
48
12.49
11.45
K‐rank‐D
5316
33.22
2.90
55.12
160
33
9.00
7.91
54
3.85
3.37
NMF
9178
11.125
8.13
2.94
825
241
3.97
3.80
300
1.52
1.44
PCNMF
9055
12.63
8.77
3.88
888
277
6.32
5.36
259
4.04
3.57
SNMF
9392
9.77
7.93
10.03
961
315
12.37
11.58
464
10.14
9.74
PCSNMF
9159
22.85
9.63
6.19
954
257
11.00
9.71
216
10.32
9.45
NMTF
9266
10.54
7.00
7.34
879
284
4.58
3.89
286
2.64
2.24
NMTFADJ
9243
10.66
9.09
1.55
867
247
10.85
8.98
234
1.60
1.27
NMTFJAC
9239
10.64
7.32
5.73
868
214
13.30
10.11
326
2.78
2.12
PCNMTF
9337
9.52
8.78
10.27
991
391
12.11
11.14
525
11.18
10.77
Information of modules detected by all compared methods on DIP and HRPDThe convergence of our proposed model PCNMTF was investigated, the values of objective function (6) with respect to the number of iterations is plotted in Fig. 3. Then we can see that our proposed model NMTF can get a local optimal value after some iterations.
Fig. 3
Values of (
6) with respect to various iteration numbers on DIP network
Values of (
6) with respect to various iteration numbers on DIP network
4.4.3 Protein complexes
Although the priori information is extracted from complex database CORUM, the amount of proteins contained in priori information is less than the number of proteins in CORUM. Since only part of proteins in CORUM is used as priori information, we need to compare the detected modules with complexes in CORUM. Another well‐known human‐related protein complex database named PCDq is used to as gold standard also. The precision, recall and F‐measure of all compared algorithms on both two PPI networks are showed in Figs. 4 and b which using CORUM as ground truth and Figs. 4 and d which using PCDq as ground truth, then we can find that the proposed algorithm PCNMTF outperforms other compared methods by means of all these three metrics except for MCODE and K‐rank‐d on precision. That is because they have detected fewer modules and proteins compared with PCNMTF (Table 4). Incorporating prior must‐link information into models can significantly improve the ability of detecting functional modules efficiently. The results on real human‐related PPI networks indicate that the proposed PCNMTF model offers a more effective way to discover considerable protein functional modules in PPI networks.
Fig. 4
Precision, recall and F‐measure of compared methods on DIP and HPRD
,
Take CORUM as ground‐truth,
,
Take PCDq as ground‐truth. (‘ml’ means must‐link and ‘gs’ means gold standard database)
Precision, recall and F‐measure of compared methods on DIP and HPRD,
Take CORUM as ground‐truth,
,
Take PCDq as ground‐truth. (‘ml’ means must‐link and ‘gs’ means gold standard database)
4.4.4 Enrichment analysis of detected modules
In order to explore the biological significance of the protein modules which are not considered in known protein complex databases, we conducted the enrichment analysis for all detected modules in terms of gene ontology (GO) annotations which contain three categories: Biological process (BP), cellular component (CC) and molecular function (MF). The extend of enrichment for each module is measured by p‐value that can be obtained by hypergeometric test [53]. The functional homogeneity can be evaluated by p‐value. For a specific GO function, a smaller p‐value always indicates that the module has a more significance biological meaning to this function. Then, the proportion of modules with p‐values less than a given threshold was calculated for all computational methods. The threshold was set from to 0.01, and then, for a specific threshold, the higher percentage of modules in the interval means the more effective of detecting functional modules from PPI networks for an algorithm. Fig. 5 presents the distribution of proportion of modules in different intervals of p‐value on DIP and HPRD in terms of BP, CC and MF and we can see that the PCNMTF performs better than the compared methods on both DIP and HPRD networks. Thus, the proposed model PCNMTF can be used to detect more homogeneous functional modules from PPI networks. In order to show what modules of human‐related PPI networks were detected, we list the top 5 significant modules in terms of BP in Tables 5 and 6 separately.
Fig. 5
Enrichment analysis of all methods
,
,
are the proportion of enriched modules from DIP with different p‐value in terms of BP, CC and MF respectively,
,
,
are the proportion of enriched modules from HPRD with different p‐value in terms of BP, CC and MF respectively
Table 5
Top 5 modules enrich on BP terms from DIP
Module ID
Size
Members
p‐Value
GO ID
GO term
256
12
ANAPC2 FZR1 MAD2L1 RAE1
4.29E−17
GO:0031145
transcription initiation from RNA polymerase II promoter
CDC23 FBXO5 BUB1B ANAPC10
CDC20 ANAPC7 PTTG1 CDC16
102
14
EIF3C EIF3D EIF3A EIF3B EIF3F
3.08E−15
GO:0006413
translational initiation
EIF3G EIF3H EIF1AX EIF3E EIF3M
EIF3K EIF3L EIF1 EIF3I EIF3J
458
13
TIFA TAB2 PSMB5 UBE2N PSMD13
4.75E−15
GO:0006521
regulation of cellular amino acid metabolic process
PSMD12 PSMC3 PSMC2 PSMD1
IL1RAP PSMD3 PSMD6 PSMD7
335
11
FGF6 FGF5 FGF8 FGF7
1.41E−14
GO:0051781
positive regulation of cell division
FGF9 FGF10 MMP14 FGF1
FGF2 FGF3 FGF4
218
11
NFE4 PPM1G WRAP53 SNUPN
7.54E−13
GO:0034660
ncRNA metabolic process
SNRPD3 SNRPD2 DDX20 SNRPF
COIL SNRPE SMN1
Table 6
Top 5 modules enrich on BP terms from HPRD
Module ID
Size
Members
p‐Value
GO ID
GO term
8
13
ABCF1 PDK1 PDK2 WFS1
7.05E−21
GO:0010510
regulation of acetyl‐CoA biosynthetic process from pyruvate
PDK3 PDK4 DLAT PDHB
ACAP3 PDHA2 C4orf27 PDHA1 PDHX
97
19
GJA8 CLDN16 CLDN8 CLDN7
2.99E−19
GO:0016338
calcium‐independent cell–cell adhesion
CLDN3 CLDN6 CLDN5 GJB3
GJA3 ARVCF KIRREL CGN
TJP3 JAM2 JAM3 TJP2
CLDN4 GJC1 CLDN2
460
16
CCL2 CXCL9 CCL19 CCL8
9.65E−19
GO:0006955
immune response
CCL28 CCL7 CCL27 CXCL10
CCL5 CCL24 CCL25 CCL13
CXCL11 CCL11 XCL2 CCL21
45
14
PRKAG3 PFKFB2 PRKAG1 PRKAG2
1.62E‐18
GO:0046320
regulation of fatty acid oxidation
PRKAB2 PRKAB1 NHLRC1
EEF2K PRKAA1 CAB39 AGL
GCKR NDUFA7 ACACB
395
18
FLRT3 FGF19 FGF6 CCDC17
1.66E−18
GO:0008286
insulin receptor signalling pathway
FGF8 FGF7 FGF17 SMG7
IL17RD HBZ C6orf47 FGF1 FGF3
FGF18 FGF5 FGF10 FGF23 FGF4
Top 5 modules enrich on BP terms from DIPTop 5 modules enrich on BP terms from HPRDEnrichment analysis of all methods,
,
are the proportion of enriched modules from DIP with different p‐value in terms of BP, CC and MF respectively,
,
,
are the proportion of enriched modules from HPRD with different p‐value in terms of BP, CC and MF respectively
4.5 Parameter analysis
There are two parameters and which can affect the performance of our proposed model PCNMTF. In order to make it clear how these two parameters work, we apply PCNMTF on GN8 networks with changing the value of and at the same time and we illustrate the distribution of NMI in terms of different parameter values. We present the influence of these two parameters in Fig. 6. We vary the value of from to and from 0.1–1000. With the same setting, we evaluate the influence in terms of f‐measure for DIP and the distribution is showed in Fig. 6. Then, we observe that proposed PCNMTF performs better when in the vicinity of 0.05 and bigger than 10. The two parameters have the same influence on LFR and HPRD. The presented results are averaged over 50 repeated experiments.
Fig. 6
Influence of
and
On GN8 network,
On DIP network
Influence of
andOn GN8 network,
On DIP network
5 Conclusion
In this manuscript, we propose a novel semi‐supervised model PCNMTF to detect overlapping protein functional modules from human PPI networks. The proposed model, PCNMTF, makes better use of topological property of PPI networks and human‐curated protein complex sufficiently. The experiments are executed on both synthetic networks and real‐world human‐related PPI networks, DIP and HPRD, and PCNMTF shows superior performance on finding functional modules although we incorporate very limited must‐links which are extracted from CORUM. Our future work would consider how to incorporate other biological function of proteins, such as gene expressions and GO functional annotations, to obtain high‐quality functional modules from human PPI networks.
Authors: Suraj Peri; J Daniel Navarro; Troels Z Kristiansen; Ramars Amanchy; Vineeth Surendranath; Babylakshmi Muthusamy; T K B Gandhi; K N Chandrika; Nandan Deshpande; Shubha Suresh; B P Rashmi; K Shanker; N Padma; Vidya Niranjan; H C Harsha; Naveen Talreja; B M Vrushabendra; M A Ramya; A J Yatish; Mary Joy; H N Shivashankar; M P Kavitha; Minal Menezes; Dipanwita Roy Choudhury; Neelanjana Ghosh; R Saravana; Sreenath Chandran; Sujatha Mohan; Chandra Kiran Jonnalagadda; C K Prasad; Chandan Kumar-Sinha; Krishna S Deshpande; Akhilesh Pandey Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971