Literature DB >> 28937982

Protein complex prediction via dense subgraphs and false positive analysis.

Cecilia Hernandez1,2, Carlos Mella1, Gonzalo Navarro2, Alvaro Olivera-Nappa3, Jaime Araya1.   

Abstract

Many proteins work together with others in groups called complexes in order to achieve a specific function. Discovering protein complexes is important for understanding biological processes and predict protein functions in living organisms. Large-scale and throughput techniques have made possible to compile protein-protein interaction networks (PPI networks), which have been used in several computational approaches for detecting protein complexes. Those predictions might guide future biologic experimental research. Some approaches are topology-based, where highly connected proteins are predicted to be complexes; some propose different clustering algorithms using partitioning, overlaps among clusters for networks modeled with unweighted or weighted graphs; and others use density of clusters and information based on protein functionality. However, some schemes still require much processing time or the quality of their results can be improved. Furthermore, most of the results obtained with computational tools are not accompanied by an analysis of false positives. We propose an effective and efficient mining algorithm for discovering highly connected subgraphs, which is our base for defining protein complexes. Our representation is based on transforming the PPI network into a directed acyclic graph that reduces the number of represented edges and the search space for discovering subgraphs. Our approach considers weighted and unweighted PPI networks. We compare our best alternative using PPI networks from Saccharomyces cerevisiae (yeast) and Homo sapiens (human) with state-of-the-art approaches in terms of clustering, biological metrics and execution times, as well as three gold standards for yeast and two for human. Furthermore, we analyze false positive predicted complexes searching the PDBe (Protein Data Bank in Europe) database in order to identify matching protein complexes that have been purified and structurally characterized. Our analysis shows that more than 50 yeast protein complexes and more than 300 human protein complexes found to be false positives according to our prediction method, i.e., not described in the gold standard complex databases, in fact contain protein complexes that have been characterized structurally and documented in PDBe. We also found that some of these protein complexes have recently been classified as part of a Periodic Table of Protein Complexes. The latest version of our software is publicly available at http://doi.org/10.6084/m9.figshare.5297314.v1.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28937982      PMCID: PMC5609739          DOI: 10.1371/journal.pone.0183460

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Understanding biological processes at a cellular and system levels is an important task in all living organisms. Proteins are crucial components in many biological processes, such as metabolic and immune processes, transport, signaling, and enzymatic catalysis. Most proteins bind to other proteins in groups of interacting molecules, forming protein complexes to carry out biological functions. Berggård et al. [1] showed that more than 80% of proteins work in complexes. Moreover, many proteins are multifunctional, in the sense that they are part of different complexes according to the specific function required in the system. The discovery of protein complexes is of paramount relevance since it helps discover the structure-function relationships of protein-protein interaction networks (PPI networks), improving the understanding of the protein roles in different functions. Furthermore, understanding the roles of proteins in diverse complexes is important for many diseases, since biological research has shown that the deletion of some highly connected proteins in a network can have lethal effects on organisms [2]. Technological advances in biological experimental techniques have made possible the compilation of large-scale PPI networks for many organisms. Given the large volume of PPI networks, many mining algorithms have been proposed in recent years for discovering protein complexes. Research on PPI networks has shown that these networks have features similar to those of complex networks based on topological structures, such as small world [3] and scale free [4] properties. These networks are also formed by very cohesive structures [5]. These properties have been the inspiration for different computational approaches that identify protein complexes in PPI networks based on topological features. Most of these strategies model PPI networks as undirected graphs, where vertices represent proteins and edges are the interactions between them. Some strategies are based on density-based clustering [6, 7], community detection algorithms [8], dense subgraphs [9-11], and flow simulation-based clustering [12]. Since there are multifunctional proteins, some strategies also consider overlap among modules. Some strategies that are based on dense subgraphs use overlapping cliques, such as CFinder [10], distance metrics [9], and greedy algorithms for finding overlapping cohesive clusters [11] (ClusterONE). However, other methods do not consider overlapping structures, such as MCL [12] and the winner of the Disease Module Identification DREAM Challenge for subchallenge 1 (closed in November, 2016), which we call DSDCluster. DSDCluster is a method that first applies the DSD algorithm [13], which consists of computing a distance metric (Diffusion State Distance) for the connected genes in the network, and then applies spectral clustering. Other known algorithms for protein complex prediction are MCODE [14], RNSC [15], SPICI [16], DCAFP [17] and COREPEEL [18]. Complete surveys of computational approaches are available [19, 20]. An important characteristic of PPI networks is that they are noisy and incomplete, mainly due to the imprecisions of biological experimental techniques. To deal with this feature some researchers associate a weight to each edge representing the probability of the interaction being real [21-23]. Weights are inferred by analyzing primary affinity purification data of the biological experiments and defining scoring techniques for the protein interactions. These studies have motivated research on complex prediction tools that consider weights in the topological properties, including or not overlaps among complexes. Most of these computational strategies model PPI networks as undirected weighted graphs. Other approaches also include functional annotations of proteins to improve the quality of predicted complexes. Some of these techniques include functional annotation analysis as a pre-processing or post-processing step for predicted complexes [24, 25]; others include functional information in the complex prediction algorithms [7, 26]. Pre-processing strategies might also define weights in PPI networks based on functional similarity, and then use clustering algorithms on weighted graphs. In these approaches it is important both the definition of the similarity measure and the clustering algorithm, which should support overlap on weighted graphs. Post-processing strategies apply functional knowledge on predicted complexes, which is also biased by the quality of the predicted complexes. Applying functional annotations during the complex discovery is an interesting approach, but it is also biased to the quality of the functional similarity definition and the algorithm time complexity. In order to validate predicted complexes, all computational strategies compare their results with gold standards used as references. Currently, CYC2008 [27] is the gold standard that reflects the current state of knowledge for yeast. This catalog contains 408 manually curated heteromeric protein complexes reliably supported by small-scale experiments reported in the literature. In fact CYC2008 was proposed as an update of MIPS (Munich Information Center of Protein Sequences) database [28], which was used as a reference until 2008. Another up-to-date reference for yeast is available at the SGD (Saccharomyces Genome Database) [29]. The prediction algorithms are important tools for updating the gold standards so that they reflect the latest biological knowledge. For example, one of the strategies used for building CYC2008 consisted in using the MCL (Markov Clustering) [12] algorithm for predicting protein complexes. This provided some complexes that were not in MIPS. Even though MCL is a very reliable algorithm, it does not support overlaps [19]. Using better prediction algorithms can therefore improve the current state of knowledge. Still, even though there are several prediction tools, there is no single method with dominating performance in terms of prediction quality and execution time for both small and large PPI networks.

Our contribution

We propose an effective and efficient strategy for predicting protein complexes, using dense subgraphs built from complete bipartite graph patterns. Even though finding densely connected subgraphs is not a new idea and surely may not be the optimal property to look for in order to identify protein complexes (indeed, it is unknown which is that optimal property), this approach makes sense from different points of view. First, it is biologically intuitive and evolutionarily logical to expect a low number of proteins to participate in many interactions, especially considering that such proteins should act as good control points for multiple related biological functions. This case is common in currently known biological networks and complexes and can explain why PPI networks have characteristics of “small-world” graphs. Second, analyzing the structural assembly of known complexes of more than two different proteins [30, 31], the majority of them implies highly connected protein nodes and cliques (see, for instance, all examples in Figure 3 of Marsh et al., 2015 [31], or Figure 6 in Ahnert et al., 2015 [30]), and there seems to be only a few ways in which protein complexes assemble. Third, protein complexes are thought to follow a few evolutionarily conserved ordered assembly pathways [32], which in the practice limits how many individual PPI interactions can be experimentally demonstrated for a given complex and how they can be translated into real complexes. In this scenario, looking for densely connected subgraphs in a PPI network may not be optimal, but it is a property representative of the new discoveries in complex assembly and it is efficient to at least screen and identify putative complexes. This has been demonstrated previously by the effective use of this approach in other algorithms, such as ClusterONE [11] and COREPEEL [18]. From an algorithmic point of view, our dense subgraph definition allows us to discover cliques and complete bipartite graphs that overlap. Since finding all maximal cliques in a graph is NP-complete [33], we propose a transformation of the input PPI network into an acyclic graph on which we design fast mining heuristics for finding dense subgraphs. Our approach is somehow related to ClusterONE [11], in the sense that ClusterONE also uses a greedy heuristic that builds groups of vertices with high cohesiveness starting at seed vertices. In our approach, we first reduce the complexity of dense subgraph mining with the construction of the the acyclic graph from an input graph representing a PPI network. Then, we apply two different objective functions; the first enables the fast traversal of the acyclic graph and the second is used for detecting maximal dense subgraphs. COREPEEL, on the other hand, is related to our algorithm in the sense that it is also based on detecting dense subgraphs, but their approach uses core decomposition for finding quasi cliques in the graph (core) and then removes nodes with minimum degree (peel). Other approaches that also predict overlapping protein complexes are GMFTP [26] and DCAFP [17]. GMFTP builds an augmented network from a PPI network by adding functional information so that protein complexes can be discovered based on cliques identified from the augmented network. DCAFP also uses topological and functional information related to PPI networks. We evaluate our algorithms using clustering and biological metrics on current yeast PPI networks, and compare our results with state-of-the-art strategies. We analyze the predicted complexes in terms of matching with three references for Saccharomyces cerevisiae (CYC2008, SGD, and MIPS) and two references for Homo Sapiens (PCDq [34], and CORUM [35]). We show that our approach improves upon the state of the art in quality and that it is fast in practice. DSDCluster achieves average performance (about the sixth best) in terms of clustering and biological metrics in all PPI networks, except on Biogrid-yeast where it is able to predict the greatest number of protein complexes that are in the CYC2008 gold standard (five more than the other methods). ClusterONE and COREPEEL provide good results and are also fast; however, our approach provides better results in terms of MMR, biological metrics and number of correct protein complexes based on gold standars in most of the PPI networks we analyzed in the manuscript. On the other hand, GMFTP and DCAFP provide good results but are several orders of magnitude slower than our approach. As said, updating the gold standards is an important application of complex prediction tools. However, most prediction approaches do not discuss the predicted complexes that are false positives with respect to the current complexes in the references. These predicted complexes are not necessarily incorrect results; they can actually be new complexes that have not yet been discovered, or can be part of biological evidence not captured in the construction of the current gold standards. In our work, we analyze the false-positive protein complexes predicted by our method (i.e., complexes not described in the gold standards), and report on our findings. Precisely, we searched for false-positive complexes that had been purified and structurally characterized in the PDBe (Protein Data Bank in Europe) database. Our results show that we achieve good performance in discovering protein complexes, while obtaining results of good quality. Compared with the state of the art, we are the first or the second best method considering the MMR measure [11] in both small and large PPI networks. Further, our automatic false positive analysis shows that many of our false positives in fact contain small curated protein complexes that are reported in PDBe and not found in gold standards: more than 50 on yeast and 300 on human proteins.

Materials and methods

In this section we present our graph definitions for modeling PPI networks, formulate the problem of finding dense subgraphs, and describe the algorithms for detecting dense subgraphs. Our approach enables us to find dense subgraphs that usually overlap among them. We then describe different alternatives for mapping dense subgraphs to protein complexes.

Graph models for PPI networks

Since the interactions among proteins in a PPI are symmetric, these networks are usually modeled as undirected graphs, where proteins are vertices and interactions between proteins are edges. We represent a PPI network with adjacency lists, where each adjacency list contains the set of neighbors of a protein. In order to find complexes, we represent each undirected edge {u, v} as two directed edges (u, v) and (v, u). Therefore, u appears in the adjacency list of v and v appears in the adjacency list of u. The PPI network is then modeled as a directed graph G = (V, E, w), where V is the set of vertices (proteins), E ⊆ V × V is the set of edges (protein-protein interactions), and w: E → [0, 1] is a function that maps an edge to a real number between 0 and 1 that represents the probability that an interaction is real.

Preliminaries

We first represent a protein-protein interaction network as a graph, where the protein names of the network are represented as vertices in the graph with numeric ids. Thus, each protein name must be mapped to a unique numeric id. Mapping protein names to numeric ids can be done using any Node ordering algorithm, such as random, lexicographic, by degree, BFS traversal, and DFS traversal, among others. Our algorithm for finding dense subgraphs looks for cliques and complete bipartite subgraphs in the PPI network. The process of finding good dense subgraphs is run over an acyclic graph called DAPG, which is built from the input PPI network. Definition 1 Directed Acyclic Prefix Graph (DAPG) Given a graph G = (V, E), a set V′ ⊆ V and a total order ϕ ⊆ V × V, we define a directed acyclic graph DAPG = (N, A), as follows: N = ⋃ adjlist(v′), A = {(u1, u2) ∈ N × N, ∃v′ ∈ V′, u1 and u2 are consecutive in adjlist(v′)}, where adjlist(v) = 〈u ∈ V, (v, u) ∈ E〉 is the adjacency list of node v in G = (V, E), listed in the total order ϕ. Using a total order ϕ for the adjacency lists of G ensures that DAPG has no cycles. We consider two possible total orders ϕ: ID sorts the nodes by their ids, whereas FREQUENCY sorts them by their indegree, or number of times they appear in all the adjacency lists of V′. Fig 1 shows the use of both relations.
Fig 1

DAPG example.

(A) shows a PPI as an undirected graph. (B) shows a PPI network as an adjacency list. (C) shows the DAPG using total order function ϕ (ID) and (D) shows the DAPG using total order function ϕ FREQUENCY.

DAPG example.

(A) shows a PPI as an undirected graph. (B) shows a PPI network as an adjacency list. (C) shows the DAPG using total order function ϕ (ID) and (D) shows the DAPG using total order function ϕ FREQUENCY. We say that a node u′ is the parent of u in DAPG iff (u′, u) ∈ A, and call root a node with no parents. A path is a sequence of nodes in DAPG, (u, u) ∈ A, with i = 1, …, n − 1. In addition, we define attributes for any node u ∈ N in DAPG based on the input graph G = (V, E), as follows: label: a unique identifier given to a node v ∈ V in G. vertexSet(u) = {v ∈ V′, (v, u) ∈ E}. In words, the vertexSet of a node u ∈ N is the set of vertices v ∈ V′ pointing to u, that is, whose adjacency lists adjlist(v) contain u. Note that the FREQUENCY order sorts nodes u by |vertexSet(u)|. Let us now define the types of dense subgraphs we will detect. Definition 2 Dense subgraph (DSG) A dense subgraph DSG(S, C) of G = (V, E) is any graph G′(S ∪ C, S × C), where S, C ⊆ V, and S × C ⊆ E, that is, it contains all the edges from a subset of nodes S to another subset C. Our implementation removes possible self-loops. Note that Definition 2 includes cliques (S = C) and bicliques (S ∩ C = ∅, known as complete bipartite graphs), but also more general subgraphs where S ∩ C ≠ 0. The following lemma defines the way we will find dense subgraphs. Lemma Given a DAPG D = (N, A), a path P = (u1, u2, …, u) in D, and a set R ⊆ P, a valid dense subgraph DSG = (S, C) is defined as S = ⋂ vertexSet(u) and C = R. In order to find a promising path in DAPG starting from a given node u, we define an inverse traveler function, as follows. Definition 3 Inverse traveler function An inverse traveler in DAPG is a partial function t: N → N, such that t(u) is a parent of u in DAPG. It gives no answer only when u is a root in N. An inverse traveler function traverses a set of nodes in DAPG, moving from a node to one of its parents, up to a root. Therefore, given a node u, the nodes in the path P are be determined by applying the function t repeatedly on u: u → t(u) → (t ∘ t)(u) → … → root. Once we have a path P we determine a set R ⊆ P, with u ∈ R, that maximizes a given objective function f defined as follows. Definition 4 An objective function is a function , where is the universe of dense subgraphs of the form H = (S, C) based on Definition 2. Objective functions maximize some feature of dense subgraphs, aiming at detecting good ones. The functions used in this work are based on the number of edges in the dense subgraphs, or on a weighted density measure. They are listed in Table 1.
Table 1

Inverse traveler and objective functions.

Inverse traveler functions
Deepestuparent p, with maximum maxDepth(p) = maxDepth(u) − 1
Sharinguparent p, with maximum |u.vertexSetp.vertexSet|
Objective functions
UNONEIntersection size: fobj(dsg) = |SC|.
WDEGREEWeighted degree density: fobj(dsg)=aE(S×C)w(a)|SC| where W(a) is the weight value in the edge a.
WEDGEWeighted edge density: fobj(dsg)=2×aE(S×C)w(a)|SC|×(|SC|-1)
FWEDGREEFull Weighted degree density: WDEGREE of the induced subgraph of SC.
FWEDGEFull Weighted degree density: WEDGE of the induced subgraph of SC.
An important advantage of our approach is that it enables the easy extension of new traveler and objective functions. New traveler functions might improve the mining process for discovering dense subgraghs and new objective functions might include biological knowledge to discover subgraphs with biological significance. Our problem can then be formulated as follows. Problem: Detecting Maximal Dense Subgraphs For a given graph G = (V, E, w), represented by a DAPG (N, A), a weight function w: E → [0, 1], a traveler function t, and a given objective function f, output a set of maximal dense subgraphs (S, C) of G.

Algorithms

Our algorithm first represents a PPI network as a graph G where each protein in the network is a vertex with a numeric id. Mapping protein names to numeric ids can be performed using any node ordering algorithm. In this work, we use six different mappings. First maps protein names to numeric ids in the order in which proteins are read from the PPI network. Lexicographic sorts the protein names and then assigns the numeric ids in that order. Degree sorts the proteins by decreasing degree in the network and then assigns the numeric ids in that order. Random maps protein names to numeric ids randomly. Finally, BFS and DFS map proteins names based on the breadth-first or depth-first search network traversal, respectively. The algorithm we propose for discovering dense subgraphs proceeds in two phases. The first phase builds an acyclic graph DAPG from G, using a total ordering function in the adjacency lists. As mentioned, we propose two total ordering functions: ID and FREQUENCY. The second phase consists in discovering dense subgraphs based on optimizing two objective functions: one guides the traversal on DAPG and the other specifies which nodes to choose. Lemma 1 enables the detection of dense subgraphs from DAPG, however, even for a given path P, finding all the possible sets R in the path requires time exponential in the number of nodes in the path. Finding the best paths P in DAPG is also exponential-time. Instead, we design an efficient mining heuristic for discovering dense subgraphs in DAPG. The main mining heuristic is based on finding at most one dense subgraph starting at each node in DAPG. This approach enables us to find dense subgraphs that might overlap. The heuristic is based on finding a promising path P = (u1, u2, …u) so that u1 is a root in DAPG. We find a promising path in DAPG starting from a given node u using an inverse traveler function given in Definition 3. The core of our mining technique starts at each node v in DAPG and walks its way to the previous node in the path up to a root. Along the path, we maintain in set S the intersection of the vertexSet of the nodes in a subset of the visited nodes (those which provide a better partial DSG), while we maintain in set C the labels of the nodes of the selected subset. Note that, at each point, (S ∪ C, S × C) is indeed a valid graph. From all those DSGs, we retain only the “best one”. We determine the “best DSG” using and objective function (f), which is a configuration parameter. We can customize the core of the mining technique based on an inverse traveler function, t, to obtain a promising path P in DAPG, and an objective function, f, to discover dense subgraphs given by Definition 2. This approach is flexible to favor given features of dense subgraphs, and allows the exploration of different ideas for determining alternative paths to improve the quality of the results. We consider the inverse traveler and objective functions defined in Table 1. In order to efficiently implement the inverse traveler function Deepest in Table 1, we attach another attribute to each node in DAPG, called maxDepth, which corresponds to the length of the longest path from a root to each node and it is defined as follows. Definition 5 MaxDepth Given a dag DAPG = (N, A), then ∀u ∈ N: Finally, the algorithm returns the best DSG it could find starting from node v. We run the algorithm starting at each node u in DAPG, so one DSG is obtained per starting node u. We only collect the maximal DSGs among those (i.e., DSGs that are not subsets of others). All algorithms are presented in S1 File. Fig 1 shows an example of a PPI network represented with a DAPG using the inverse traveler function Deepest, f = UNONE, using total order functions ϕ sorting by ID (C) and by FREQUENCY (D). With this representation, we are able to discover cliques C1 = (1, 2, 3), C2 = (3, 4, 5, 6) and C3 = (4, 5, 6, 7).

Analysis of the algorithms

Let n be the number of nodes in DAPG, h ≤ n be the longest path, and e ≤ n be the maximum number of neighbors of a node. Then, our algorithm starts from each node in DAPG, with an initial vertexSet of size at most e, and walks some path upwards to the root, performing at most h steps. At each step it must compute the distance traveler function, which in our examples costs O(1) or O(e) time. It also intersects the vertexSet of the new node with the current candidates, in time O(e), and determines whether or not to keep the current node in the set C. All the criteria we use for the latter can be computed in time O(e). Therefore, the total time of this process is O(nhe). Let m be the maximum number of maximal subgraphs produced along the process. Once the new subgraph is produced, we compare it with the O(m) current maximal subgraphs, looking for those that include or are included in the new one, in order to remove the included ones (or the new one). This costs O(nme) time. The total cost is therefore O(ne(h + m)). This is O(n3) in the worst case, but much less in practice. For example, in Collins we have n = 1,622, e = 127, h = 187, and m = 12, and therefore ne(h + m) is 25,273n, which is 100 times less than n3 = 2,630,884n

Protein complex prediction

We define protein complexes from the DSGs we discover in PPI networks. Since we obtain at most one DSG starting at each node in DAPG, our algorithm is able to obtain DSGs that are in overlap. Let a parameter minSize define the minimum size of a candidate complex. Then, each DSG(S, C) is considered as a candidate complex with nodes S ∪ C whenever |S ∪ C| ≥ minSize. We generate predicted complexes from candidate complexes based on two different filter options: NONE, where a predicted complex is always a candidate complex, and UNION, where a predicted complex is formed by the set union of the complex pairs with overlap score (Eq 1) greater than a threshold (we used threshold = 0.8).

Experimental setup

We implemented the algorithms in C++ and executed all the experiments on a 64-bit Linux machine with 8GB of main memory and with an Intel CPU with i7 2.7GHz. All state-of-the-art methods are also executed on the same machine, except COREPEEL, which provide its method through its web site. We used yeast (Saccharomyces cerevisiae) and human (Homo Sapiens) PPI networks for experimental evaluation. Specifically, we used the following yeast PPI networks: Collins [21], Krogan core and Krogan extended [22], Gavin [23], DIP-yeast (available in [18]) and BioGrid (version 3.4.138) for yeast (available at http://thebiogrid.org). We used human PPI networks Biogrid (version 3.4.138) and HPRD [36]. We compared our complex prediction results against the up-to-date complex yeast reference CYC2008 [27], SGD (available at http://www.yeastgenome.org), and MIPS (obtained from the ClusterONE distribution [11]). For human proteins we used PCDq [34] and CORUM [35]. Table 2 shows the main statistics of PPI networks we used and Table 3 displays the number of complexes of each reference plus the number of complexes obtained by merging them. Since performing an exact merging of gold standards might be difficult, we approximate the merge procedure as follows: If the same protein complex name is found, then the merged version contains only one copy. If the protein complex names are different and the complexes contain the same proteins, then the merged version also contains one copy. If both the complex name and the proteins are different, then the merged reference contains both complexes.
Table 2

Main statistics of PPI networks.

ProteinsInteractionsAvg degree
Saccharomyces cerevisiae (yeast)
Collins1,6229,0745.59
Krogan core2,7087,1232.63
Krogan extended3,67214,3173.89
Gavin1,8557,6694.13
DIP-yeast4,63821,3774.60
Biogrid yeast6,436229,40935.64
Homo sapiens (human)
HPRD9,45336,8673.90
Biogrid human17,545233,68813.31
Table 3

Main statistics of protein complex references.

NameComplexesURL
Saccharomyces cerevisiae
CYC2008408http://wodaklab.org/CYC2008/
SGD372http://www.yeastgenome.org/download-data/curation
MIPS203http://www.paccanarolab.org/clusterone/
CYC2008, SGD582Built
CYC2008, SGD, MIPS614Built
Homo sapiens
CORUM1,679http://mips.helmholtz-muenchen.de/genre/proj/corum/
PCDq1,263http://h-invitational.jp/hinv/pcdq/
CORUM, PCDq2,881Built
For biological metrics, we also used current state-of-the-art gene ontology and annotations, available at http://www.geneontology.org. We considered state-of-the-art complex prediction methods such as ClusterONE [11], MCL [12], CFinder [10], GMFTP [26], MCODE [14], RNSC [15], SPICI [16], DCAFP [17] and COREPEEL [18]. For each method we used the parameters that provided the best results. To evaluate the effectiveness of our clustering approach we considered clustering and biological metrics. Clustering metrics measure the quality of the complexes in terms of how well the predicted complexes are related to the reference complexes. Biological metrics assess the probability that proteins in predicted complexes form real complexes (given by a reference) based on the relationship among the proteins in terms of their localization and the annotations. Proposed methods usually measure the degree of matching between a predicted and a real complex [19]. This metric is usually called Overlap Score (OS) or Network Affinity (NA). If pc is the set of vertices forming a predicted complex and rc the set of vertices forming a complex in the reference, we have Eq 1 for OS: Many research works declare a match between a predicted and a reference complex when OS ≥ w (generally w = 0.2 or 0.25 [19]). We used three clustering evaluation metrics usually found in complex prediction evaluations: FMeasure, Accuracy (Acc) and Maximum Matching Ratio (MMR). FMeasure is defined in terms of Precision and Recall, which depend on the definition of True Positives (TP), False Positives (FP) and False Negatives (FN). TP is the number of predicted complexes with an OS over a threshold value for some reference complex, and FP is the total number of predicted complexes minus TP. FN is the number of complexes known in the reference that are not matched by any predicted complex. Precision and Recall are metrics that measure, respectively, how many predicted complexes are correct with respect to the total number of predicted complexes, and how many reference complexes are correctly predicted. Eq 2 gives their formulas. It also gives the formula for FMeasure, which is the harmonic mean of Precision and Recall and is used, among other metrics, to measure the overall performance of clustering algorithms. Acc is the geometric mean of Sensitivity S and Positive Predicted Value PPV. S shows how good is the identification of proteins in the reference complexes in terms of coverage, and PPV indicates the probability of that the predicted complexes are TP. Eq 3 displays the equations for S, PPV, and Acc. T is the number of proteins in common between the i reference complex and j predicted complex; n is the number of complexes in the reference and m the number of predicted complexes; N is the number of proteins in the i reference complex, and . Since several research works use FMeasure and Acc as clustering evaluation metrics, we included them as well. However, they are not free of problems. For instance, Acc penalizes predicted complexes that do not match any of the reference complexes, when some of the predicted complexes might indeed be undiscovered complexes. We also used MMR measure, introduced by Nepusz et al. [11] to avoid the penalization of accuracy metrics over clusters with significant overlaps. MMR is based on a maximal one-to-one mapping between predicted and reference complexes. MMR represents a bipartite graph where one set of nodes is formed by the predicted complexes and the other by the reference complexes. Each edge has a weight representing the overlap score between the two vertices. The maximum weighted bipartite matching on this graph measures the quality of predicted complexes with respect to the reference complexes. The MMR score is given by the sum of the weights of the edges on this graph divided by the number of reference complexes. MMR offers a good comparison between predicted and reference complexes, penalizing those cases when reference complexes are found in two predicted complexes with high overlap. In order to compute the MMR (Eq 4), ClusterONE first matches each reference complex (rc) to a predicted complex (pc) that maximizes the average OS over all reference complexes (considering a minimum OS ≥ 0.2). One important feature of PPI networks is that they are incomplete and noisy. Biological processes for discovering protein interactions are not error free. In consequence, PPI networks might miss proteins with their interactions or include interactions that are not real. Algorithms should consider this feature to improve mining results [19]. This fact can be observed by looking at the proteins that are in PPI networks and the proteins that are in the reference. Nepusz et al. [11] consider the three following cases for proteins in PPI networks and the reference. Proteins appearing in the PPI and in the reference. Proteins appearing in the PPI, but not in the reference. Proteins appearing in the reference, but not in the PPI. Evaluating mining algorithms for the cases (1) and (2) is straightforward since protein interaction can be captured by the mining algorithm. Complexes found in case (2) might owe to mistakes on the mining algorithm or incompleteness of the reference, therefore this last case might require an analysis of the false positives generated by the mining algorithm. However, finding complexes in case (3) is impossible for any mining algorithm based on clustering. A possible simple solution to evaluate a mining algorithm would be not to consider reference complexes containing proteins unknown in the PPI, but if these protein interactions are missing in large predicted complexes then there might not be a good reason to eliminate the complete complex. Based on these considerations Nepusz et al. [11] propose filtering the references for evaluating a mining algorithm. The procedure is given as follows: Identify all proteins that had at least one known interaction with other proteins in the input PPI. For each complex in the reference, identify its proteins and compute the set intersection with all proteins in the input PPI. If the set intersection size of a reference complex in the previous step is less than half of the size of the complete reference complex, such reference complex is eliminated because too many proteins are missing in the input PPI, and even if this complex is predicted might not be because of the quality of the algorithm. If the set intersection size of a reference complex is greater than half of the size of the complete reference complex, the reference complex is considered but all proteins that are unknown to the PPI are eliminated. This action does not improve the quality of the mining algorithm since all algorithms are assessed on the same reference and those proteins could not be inferred anyways. In order to provide a fair way to compare our approach against other proposed methods, we used the implementation just described [11], available at https://github.com/jboscolo/RH/find/master. Such implementation includes the computation of FMeasure, Acc and MMR.

Biological measures

Besides clustering measures, we consider biological relevance metrics. In this context we used Colocalization and Gene Ontology Similarity (GoSim). Colocalization measures the relationship of proteins based on where they are located in the cell and organism. The idea is that since protein complexes are assembled to perform a specific function, proteins within the same complex tend to be close to each other [37]. The idea of GoSim comes from the Gene Ontology Annotations, which basically describe the functions in which proteins work. Since protein complexes are formed to perform on specific functions, proteins forming a complex tend to share similar functionality [38]. We used the software ProCope to measure Colocalization and GoSim. ProCope is available at https://www.bio.ifi.lmu.de/software/procope/index.html [39]. We also include a biological measure that measures the biological significance of predicted protein complexes using enrichment analysis. In order to compute the biological significance of predicted complexes we use the same method described in [40], taking into account the p-values of predicted complexes, which represent the probability of co-occurrence of protein with common functions. As in [40], we also used BINGO [41], which is a Cytoscape [42] plugin that computes which GO categories are statistically overrepresentated using hypergeometric test in a set of genes. A low p-value for a set of genes in a predicted complex indicates that those proteins are statistically relevant in the complex. Typically considering a p-value < 0.01 is considered as a significant predicted complex. We measure significant complexes as percentage (SC).

Clustering performance results

As mentioned in previous sections, we considered clustering metrics used by other clustering strategies such as FMeasure, Accuracy (Acc) and Maximum Matching Ratio (MMR). Specifically, we used the ClusterONE implementation of Acc and MMR metrics and we added support for FMeasure to compare all clustering techniques considered for comparison. ClusterONE implementation eliminates reference complexes that contain more than 50% of proteins that are unknown (i.e., proteins that are absent in the PPI network) and removes unknown proteins of complexes that contain less than 50% of such proteins.

Parameter tuning

First, we define different node ordering algorithms to map the protein names to unique numeric ids in the graph. We consider the node ordering algorithms already described: First, Random, Degree, Lexicographic, BFS, and DFS. We compared our results according to the different parameters we have in our algorithms. We present a summary of the main parameters we provide in our approach in Table 4. With Protein Mapping we specify the text file describing the mapping from proteins to numeric ids. With Graph Type we specify the type of graph, which can be undirected unweighted, UNONE, or undirected weighted, USYM. With alternative f, we choose an objective function f based on weighted density in the mining algorithm. to detect best dense subgraphs (the default function, f = |S ∩ C|, is used with option UNONE). With Sorting we specify the sorting algorithm of adjacency lists; it can be by ID or by FREQUENCY. Finally, Grouping allows us to define how predicted complexes are built based on candidate complexes. Alternatives are UNION, which takes the union C ∪ C of the complexes where OS(C, C) > 0.8, and NONE, where predicted complexes are defined as the candidate complexes. Other parameters include the minimum size, minSize, of any complex, the type of dense subgraph (only clique or dense subgraphs) and an alternative mapping for input PPI networks.
Table 4

Parameter settings.

OptionsDescription
Protein mapping (-m)
mappingFileFile mapping protein names to numeric ids
Sorting (-r)
FREQUENCYSorting of adjacency list by frequency before building DAPG
IDSorting by id in adjacency list before building DAPG
Grouping (-f): Predicted protein complex formation (PC) using OS(Cx, Cy) > 0.8
UNIONPC = CxCy
NONECx and Cy
Graph Types (-g)
UNONEUndirected-unweighted graph
USYMUndirected-weighted graph
Alternative fobj (-w)
WEDGESelect the dense subgraphs with higher weighted-edge-density
WDEGREESelect the dense subgraphs with higher weighted-degree-density
FWEDGREESelect the dense subgraphs with higher weighted-edge-density of SC induced subgraph.
FWEDGESelect the dense subgraphs with higher weighted-degree-density of SC induced subgraph.
In order to compare our results we tried different node ordering (protein mapping) algorithms and different parameters in each experiment, given in the following format: DAPGGTypeDM-rSorting-fGrouping (Protein Mapping). In this format GType can be UU (undirected unweighted) or UW (undirected weighted), DM can be any of the density measures; Sorting can be adjacency lists sorted by frequency (F) or ID (I); and Grouping is the way we group candidate complexes to generate predicted complexes, defined by the union set (U) or none (N). Tables 5 and 6 show the performance of our algorithm with different node ordering algorithms (protein name to numeric id mapping) and total order function ϕ (ID, FREQUENCY). We observe that using BFS and DFS traversals provides best results in seven of the eight PPI networks we tested. Also the total order function Sorting by ID is very effective with these protein mappings, achieving best results in six of the eight PPI networks.
Table 5

Results of best clustering metrics (with CYC2008 gold standard) obtained with DAPG (with complexes of minimum size 3) using different node ordering algorithms and applying sorting (ϕ function) in small PPIs.

NetworkNode orderingSortingComplexesFMeasureAccMMR
CollinsFirstFREQUENCY6200.72690.72260.7020
ID4470.67820.71150.6749
LexicographicFREQUENCY6230.73410.72590.7043
ID4100.69830.71330.6469
RandomFREQUENCY6260.74660.72250.7141
ID4000.65170.70910.5986
DegreeFREQUENCY6230.72800.72180.7036
ID4840.67820.71600.6870
BFSFREQUENCY6330.72480.72340.7183
ID4950.65780.71200.6739
DFSFREQUENCY6180.72890.71820.6999
ID5090.66410.71060.6791
Krogan CoreFirstFREQUENCY6510.64480.61780.4699
ID5580.61910.64260.4814
LexicographicFREQUENCY6270.64000.63910.4582
ID4720.60270.62230.4321
RandomFREQUENCY6270.63730.61990.4391
ID4030.60300.59470.3863
DegreeFREQUENCY6360.65160.61460.4688
ID5640.60230.60600.4577
BFSFREQUENCY6140.63880.62790.4562
ID6580.57840.61430.4991
DFSFREQUENCY6270.63530.63450.4556
ID6490.67820.62420.5059
Krogan ExtendedFirstFREQUENCY9600.51420.61520.4226
ID8640.48510.62480.4489
LexicographicFREQUENCY9690.52940.63370.4321
ID7320.48760.61200.4108
RandomFREQUENCY9430.52500.62730.4328
ID8090.40070.58160.3163
DegreeFREQUENCY9470.51800.61720.4274
ID8950.47200.61520.4212
BFSFREQUENCY9430.53030.62840.4217
ID9700.47100.59470.4100
DFSFREQUENCY9670.52440.62320.4188
ID8300.54110.62260.4724
GavinFirstFREQUENCY6110.65160.70830.5809
ID6410.57520.70550.5838
LexicographicFREQUENCY6260.64910.70610.5827
ID5030.60130.70280.5446
RandomFREQUENCY6670.64410.71100.5908
ID4740.58840.69010.5270
DegreeFREQUENCY6120.65090.70890.5840
ID5290.60970.69360.5592
BFSFREQUENCY6210.64540.71720.5819
ID7150.61640.71350.6079
DFSFREQUENCY6200.65890.71480.5975
ID7230.55000.69900.6006
Table 6

Results of best clustering metrics (with CYC2008 and CORUM references) obtained with DAPG (with complexes of minimum size 3) using different node ordering algorithms and applying sorting (ϕ function) in large PPIs.

NetworkNode orderingSortingComplexesFMeasureAccMMR
DIP-yeastFirstFREQUENCY1,2170.40000.55200.3615
ID1,1410.39420.54160.3815
LexicographicFREQUENCY1,1990.38720.53550.3550
ID1,0850.40850.55650.3610
RandomFREQUENCY1,1420.40700.53640.3491
ID9090.34380.48080.2535
DegreeFREQUENCY1,2120.39610.54890.3682
ID1,1650.38350.53930.3560
BFSFREQUENCY1,2530.41970.56740.3751
ID1,2420.36220.55510.3718
DFSFREQUENCY1,2100.41100.54500.3671
ID1,9250.38300.54860.4447
Biogrid-yeastFirstFREQUENCY5,0250.15510.56910.3534
ID4,9450.14440.56930.3371
LexicographicFREQUENCY4,9990.15610.57270.3687
ID4,9910.17400.59670.3845
RandomFREQUENCY5,0170.15480.57180.3599
ID5,1670.11080.53680.2614
DegreeFREQUENCY5,0490.15330.56670.3439
ID5,0040.14650.56770.3432
BFSFREQUENCY4,9770.15840.57410.3650
ID5,2540.10470.53550.2711
DFSFREQUENCY5,0090.15700.57200.3627
ID4,9500.14460.58000.3468
HPRDFirstFREQUENCY2,4370.33950.21400.1713
ID2,4420.32000.22720.1743
LexicographicFREQUENCY2,4300.35280.21030.1783
ID2,0850.35420.20990.1643
RandomFREQUENCY2,4300.34650.21210.1688
ID1,9770.34640.18790.1326
DegreeFREQUENCY2,4490.34010.21350.1706
ID2,4120.33540.21270.1675
BFSFREQUENCY2,4410.35840.21390.1865
ID2,7770.36850.21190.2066
DFSFREQUENCY2,4430.34840.21050.1668
ID2,3130.33920.23400.1862
Biogrid-humanFirstFREQUENCY7,3600.23800.29240.2387
ID7,2000.23490.28250.2372
LexicographicFREQUENCY7,3940.24740.29200.2405
ID7,3130.25070.27380.2385
RandomFREQUENCY7,3160.24920.29070.2332
ID7,6630.25870.27320.2227
DegreeFREQUENCY7,3750.24120.29200.2418
ID7,3520.23520.29180.2374
BFSFREQUENCY7,1520.24530.29020.2354
ID8,1440.22040.28540.2232
DFSFREQUENCY7,4090.25270.29170.2539
ID6,4980.23090.28770.2228
We also explore the impact of adding random edges into a PPI networks. We present these results in Table 7. We observe that our scheme is robust based on the clustering metrics.
Table 7

Adding random interactions in yeast and human PPI networks (with CYC2008 and CORUM references) obtained with DAPG (with complexes of minimum size 3).

NetworkEdges increased (%)ComplexesFMeasureAccMMR
Collins55220.71950.71020.6619
105010.70410.72700.6447
Krogan Core56110.66050.61650.4844
105910.65740.62900.4908
Krogan Extended57900.52870.61280.4430
107400.55060.61770.4410
Gavin56810.59960.70950.5879
106640.60720.71850.5733
DIP-yeast51,9890.38520.54710.4476
102,0110.38200.54990.4499
Biogrid-yeast54,9710.16860.59560.3787
104,9660.16150.59630.3737
HPRD52,6920.35820.21910.2000
102,1670.34620.21530.1897
Biogrid-human57,0470.24020.29980.2392
106,8570.23730.29250.2297
We show our best results in Table 8 using all gold standards. We obtain our best results using the objective function as f = |S ∪ C| and only in DIP-yeast the degree density (WDEGREE) is better. We also obtain best results without merging or combining dense subgraphs, which is given by the grouping option NONE as described in Table 4.
Table 8

Our best results of clustering metrics obtained with DAPG (with complexes of minimum size 3).

NetworkAlgorithmComplexesReferenceFMeasureAccMMR
CollinsDAPGU(BFS) rFfN633
CYC20080.72480.72340.7183
SGD0.60370.54090.5956
MIPS0.54490.54170.4956
Krogan CoreDAPGU(DFS) rIfN649
CYC20080.67820.62420.5059
SGD0.62660.45190.4153
MIPS0.46120.37930.3085
Krogan ExtendedDAPGU(DFS) rIfN830
CYC20080.54110.62260.4724
SGD0.48360.44000.3662
MIPS0.37240.36790.2747
GavinDAPGU(BFS) rIfN715
CYC20080.61640.71350.6079
SGD0.51880.52700.4956
MIPS0.43760.48270.4304
DIP-yeastDAPGUWD(DFS) rIfN1,925
CYC20080.38300.54860.4447
SGD0.34730.40080.3620
MIPS0.29920.34750.3607
Biogrid-yeastDAPGU(Lex) rIfN4,991
CYC20080.17400.59670.3845
SGD0.16710.46270.3737
MIPS0.12920.39250.2994
HPRDDAPGU(BFS) rIfN2,777
CORUM0.36850.21190.2066
PCDq0.34310.29920.1681
Biogrid-humanDAPGU(DFS) rFfN7,409
CORUM0.25270.29170.2539
PCDq0.15990.34950.1272

Results

In this section we compare our best results with the state-of-the-art techniques such as ClusterONE [11], MCL [12], CFinder [10], GMFTP [26], MCODE [14], RNSC [15], SPICI [16], DCAFP [17], COREPEEL [18] and DSDCluster (winner of the challenge Disease Module Identification DREAM Challenge for subchallenge 1, https://www.synapse.org/#!Synapse:syn6156761/discussion/threadId=1073). For each method we used the parameters that provided the best results. In the case of GMFTP we use default parameters (τ = 0.2, K = 1000, λ = 4, T = 400, ρ = 1e − 6) and set repeat_times = 10 instead of the default, which was 100. With this change we could actually get results in a little more than 12 hours for each PPI network. For CFinder the most sensible parameter is t, which is the allowed time to spend in the detection for clique search per node. We used t = 1 and t = 10 and took the best result. Since GMFTP took too much execution time for small PPI networks (over 12 hours) we did not try to run it with larger PPIs. Also, we were unable to execute CFinder with the two largest PPI networks, and with DCAFP we have a memory error with Biogrid-human, therefore we do not report results for these cases. The main parameter for executing DSDCluster is the number of clusters (K). We executed DSDCluster with K between 100 and 700, increasing by 100 in Collins, Krogan Core, Krogan Extended, and Gavin. In DIP-yeast we reach K = 1600. For Bigrid-yeast, HPRD and Biogrid-human we define K = 500, 1000, 1500, 2000, 2500. We obtain the best results with K = 200 in Collins, K = 500 in Krogan Core, K = 700 in Krogan Extended, K = 500 in Gavin, K = 1200 in DIP-yeast, K = 1000 in Biogrid-yeast, K = 2000 in HPRD, and K = 2500 in Biogrid-human. Tables 9 to 14 show our results compared with the state-of-the-art techniques available for protein complex prediction for yeast. Similarly, Table 15 show the results for human. We evaluated clustering metrics and biological metrics. We observed that we are able to obtain the best MMR measure in Collins, Gavin, DIP-yeast and Biogrid-yeast PPI networks using the three gold standards and our combinations. In the Krogan Core PPI we obtain the second best after GMFTP, which is the best for the three gold standards, but we are better in the combined references. In the Krogan Extended PPI we are best using CYC2008, GMFTP is best with SGD and COREPEEL is best in MIPS, in the merged gold standards COREPEEL is the best, and we are second. We also observed that, for most human PPIs, COREPEEL is the best and we are second. We also report execution times, where all methods were executed locally, except COREPEEL, which provide the execution through its web site and report execution time as a result. SPICI is the fastest method.
Table 9

Performance comparison results of clustering and biological metrics in Collins.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
CollinsCYC2008
DAPG6330.72480.72340.71830.96920.76920.94352.36
GMFTP1890.76310.78580.64100.95420.74890.9052> 12hrs.
ClusterONE1870.69400.76770.57110.92110.71240.82251.37
MCL1950.68970.76350.57290.92680.73100.88230.74
CFinder1130.65830.65180.43610.86410.61730.9027119.54
DCAFP8800.84330.67840.55750.93860.72120.9234231.18
RNSC1780.69800.77560.58120.93130.73970.89301.42
MCODE930.62330.60350.32130.87500.63450.91250.52
SPICI1040.65790.71450.41150.94760.75460.92140.14
COREPEEL4580.67510.70370.67180.95010.73770.93340.23
DSDCluster1420.46260.60650.28630.91790.75330.894341.93
SGD
DAPG6330.60370.54090.5956
GMFTP1890.67950.59880.5295
ClusterONE1870.58170.60170.4357
MCL1950.60390.58850.4500
CFinder1130.51260.51430.3215
DCAFP8800.70910.51030.4959
RNSC1780.62070.58990.4432
MCODE930.50480.50500.2430
SPICI1040.58450.54560.3096
COREPEEL4580.56460.52510.5151
DSDCluster1420.38380.45950.2124
MIPS
DAPG6330.54490.54170.4956
GMFTP1890.53560.53380.4269
ClusterONE1870.55170.54390.4110
MCL1950.47420.50700.3856
CFinder1130.50230.44300.3042
DCAFP8800.69300.52750.4302
RNSC1780.51470.51820.4070
MCODE930.55320.48040.2808
SPICI1040.55000.50460.3063
COREPEEL4580.47390.52710.4402
DSDCluster1420.38380.45950.2124
CYC2008, SGD
DAPG6330.71570.55910.5837
GMFTP1890.72020.58460.4549
ClusterONE1870.63250.58420.3955
MCL1950.64240.57090.4034
CFinder1130.53480.50050.2914
DCAFP8800.81930.53320.5008
RNSC1780.66240.57940.4044
MCODE930.55080.47450.2274
SPICI1040.57720.53430.2743
COREPEEL4580.66670.53750.5032
DSDCluster1420.28340.42950.1688
CYC2008, SGD, MIPS
DAPG6330.71010.54800.5723
GMFTP1890.71430.57700.4376
ClusterONE1870.62650.57650.3825
MCL1950.64240.56160.3903
CFinder1130.52010.49070.2803
DCAFP8800.81190.52530.4891
RNSC1780.65810.57130.3939
MCODE930.54240.47000.2185
SPICI1040.56450.52790.2640
COREPEEL4580.66200.52690.4961
DSDCluster1420.44070.46280.2101
Table 14

Performance comparison results of clustering and biological metrics in Biogrid-yeast.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
Biogrid-yeastCYC2008
DAPG4,9910.17400.59670.38450.71430.54100.6524144.58
ClusterONE3690.31320.54260.15990.82410.63700.420342.74
MCL1360.09190.28720.03030.56240.57940.515663.23
DCAFP1,5450.42500.46420.28460.65900.41490.904320,063.2
RNSC7550.12640.58680.13010.66800.58220.4351128.29
MCODE240.00770.12200.00140.45820.33550.75235,562.32
SPICI3890.16180.51540.08390.63170.47970.54340.82
COREPEEL5,4060.20480.54900.34120.73560.56110.691823.02
DSDCluster5570.30190.55760.22820.64140.53400.68794.5 hrs.
SGD
DAPG4,9770.14840.43860.3405
ClusterONE3690.30620.43410.1438
MCL1360.08520.23130.0296
DCAFP1,5450.40480.37290.2731
RNSC7550.12630.46850.1174
MCODE240.00670.08850.0012
SPICI3890.14690.41560.0680
COREPEEL5,4060.16540.41160.3038
DSDCluster5570.26860.41440.1885
MIPS
DAPG4,9770.10380.37870.2700
ClusterONE3690.20940.37690.1096
MCL1360.05590.19430.0221
DCAFP1,5450.36660.38190.2667
RNSC7550.09050.40160.1026
MCODE240.00940.10740.0017
SPICI3890.11170.38610.0684
COREPEEL5,4060.14370.35700.2431
DSDCluster5570.19510.35100.1597
CYC2008, SGD
DAPG4,9770.18340.40980.3294
ClusterONE3690.34120.41670.1332
MCL1360.07970.21130.0247
DCAFP1,5450.45780.35070.2552
RNSC7550.14690.46100.1057
MCODE240.00500.08750.0008
SPICI3890.16030.39640.0614
COREPEEL5,4060.21640.38020.2935
DSDCluster5570.31770.40830.1783
CYC2008, SGD, MIPS
DAPG4,9770.18850.40320.3219
ClusterONE3690.33420.40650.1281
MCL1360.07950.20550.0236
DCAFP1,5450.45690.34300.2593
RNSC7550.14470.45180.0999
MCODE240.00470.08570.0008
SPICI3890.15850.38760.0590
COREPEEL5,4060.22170.37510.2897
DSDCluster5570.31310.40030.1691
Table 15

Performance comparison results of clustering and biological metrics in HPRD and Biogrid-human.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
HPRDPCDq
DAPG2,7770.34310.29920.16810.92250.41920.656430.78
ClusterONE2,1860.29230.51220.17180.77350.41060.31144.6
MCL1,2480.21670.47170.11200.74300.38310.415010.39
CFinder4160.16370.29350.05980.62830.32840.238312.42
DCAFP1230.11850.16540.00860.85320.34400.884825,470.12
RNSC1,0810.22500.44450.11220.82350.42410.38622.32
MCODE160.01700.10030.00410.80330.58060.655310.23
SPICI7220.24100.41480.08350.78560.38010.45100.82
COREPEEL3,4200.35770.29430.18520.92490.40740.66671.01
DSDCluster1,2470.20120.41810.09940.73890.38740.54053.8 hrs.
CORUM
DAPG2,7770.36850.21190.2066
ClusterONE2,1860.13480.31620.0730
MCL1,2480.10480.30420.0488
CFinder4160.07690.19820.0270
DCAFP1230.14900.14600.0270
RNSC1,0810.12340.27730.0565
MCODE160.01540.07860.0047
SPICI7220.10950.25660.0357
COREPEEL3,4200.40170.21310.2360
DSDCluster1,2470.10560.26710.0510
CORUM, PCDq
DAPG2,7770.47570.19870.1788
ClusterONE2,1860.28870.34850.1101
MCL1,2480.19360.32330.0701
CFinder4160.11660.20360.0368
DCAFP1230.08980.11610.0155
RNSC1,0810.20800.30100.0743
MCODE160.00940.06520.0027
SPICI7220.19460.27610.0506
COREPEEL3,4200.51680.19700.2033
DSDCluster1,2470.18840.28370.0661
Biogrid HumanPCDq
DAPG7,4090.15990.34950.12720.82130.40410.5443620.32
ClusterONE4,2540.08630.48020.06530.64760.40080.2532201.32
MCL1,4330.04310.35940.01900.62250.36950.239254.21
RNSC2,1940.07740.44910.05020.82350.39710.220635.23
MCODE200.00630.08830.00130.83120.36950.5262475.23
SPICI1,0630.08030.37840.02630.67630.37290.38291.01
COREPEEL9,7720.19950.32000.15500.84680.40590.578210.83
DSDCluster1,5930.06100.36730.03070.63440.36010.41485.5 hrs.
CORUM
DAPG7,4090.25270.29170.2539
ClusterONE4,2540.05290.36250.0417
MCL1,4330.04030.26100.0179
RNSC2,1940.06370.36320.0418
MCODE200.01050.10460.0032
SPICI1,0630.06430.30130.0235
COREPEEL9,7720.34770.27780.3063
DSDCluster1,5930.08240.31180.0409
CORUM, PCDq
DAPG7,4090.30020.25850.1847
ClusterONE4,2540.10200.37090.0485
MCL14330.05120.26550.0165
RNSC2,1940.09210.35960.0402
MCODE200.00690.08780.0018
SPICI1,0630.08360.28990.0217
COREPEEL9,7720.39650.24140.2250
DSDCluster1,5930.08480.29040.0305

Evaluating overlap on predicted complexes

In this section we evaluate how well protein complexes in gold standards are matched with predicted complexes. We first evaluated and compared the protein complex overlap as described earlier using cumulative histograms. We compute the cumulative histogram of all pairs of reference complex and predicted complex (c, pc) obtained when computing the MMR (where OS(c, pc) ≥ 0.2). We also compute the MMR varying the overlap score threshold. Figs 2 and 3 (left column) shows the cumulative histogram for overlap between predicted and reference complexes for all PPIs. Figs 2 and 3 (right column) shows the MMR for different overlap scores. We observed that DAPG is best in Collins and DIP-yeast, although, we did not tried GMFTP in DIP-yeast because it was several orders of magnitude slower than DAPG in smaller PPIs (as seen in Tables 9 to 12). We also show that DAPG has the best MMR results considering different overlap scores.
Fig 2

Cumulative histogram for predicted complexes matches with reference complexes based on MMR on small PPIs.

Matching predicted complexes to reference complexes cumulative histogram for various yeast PPI networks and references CYC2008. Figures on right column show how MMR varies when changing the overlap score.

Fig 3

Cumulative histogram for predicted complexes matches with reference complexes based on MMR on large PPIs.

Matching predicted complexes to reference complexes cumulative histogram for a large yeast PPI network using references CYC2008, and two Human PPI networks using gold standard CORUM. Figures on right column show how MMR varies when changing the overlap score.

Table 12

Performance comparison results of clustering and biological metrics in Gavin.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
GavinCYC2008
DAPG7150.61640.71350.60790.87500.66870.80411.66
GMFTP2420.60960.77050.58610.85860.67610.7561> 12hrs
ClusterONE1940.68540.74980.53780.89340.68100.83671.41
MCL2540.53720.74350.48280.78650.63420.71242.01
CFinder1830.44660.62100.33910.73350.53700.6412598.84
DCAFP8040.71180.62960.44160.88550.66260.7843133.79
RNSC2410.55560.75510.51060.81880.65660.71350.056
MCODE1070.52810.60920.25470.80810.59540.798211.28
SPICI910.65740.59050.33810.89650.74580.89720.09
COREPEEL6900.57950.69980.56860.86430.68830.77530.15
DSDCluster2650.53900.69180.46620.81010.65870.660363.70
SGD
DAPG7150.51880.52700.4956
GMFTP2420.53930.58420.4448
ClusterONE1940.58550.57020.3974
MCL2540.46410.55020.3510
CFinder1830.35290.47940.2526
DCAFP8040.63930.48490.4062
RNSC2410.46380.57030.3731
MCODE1070.39640.47630.1784
SPICI910.54810.45090.2473
COREPEEL6900.46920.50670.4643
DSDCluster2650.45430.51020.3419
MIPS
DAPG7150.43760.48270.4304
GMFTP2420.46020.52400.4206
ClusterONE1940.48460.49810.3728
MCL2540.37460.49830.3266
CFinder1830.35590.43820.2618
DCAFP8040.55520.46280.3732
RNSC2410.40120.49900.3560
MCODE1070.40380.43620.2007
SPICI910.43750.37370.2182
COREPEEL6900.40490.46790.4262
DSDCluster2650.35520.45200.3092
CYC2008, SGD
DAPG7150.61630.51370.4893
GMFTP2420.61140.56860.4197
ClusterONE1940.65660.54760.3706
MCL2540.51680.54400.3296
CFinder1830.39880.47680.2365
DCAFP8040.70740.46060.4016
RNSC2410.53080.55420.3452
MCODE1070.44710.46010.1707
SPICI910.59920.43030.2357
COREPEEL6900.57520.50000.4607
DSDCluster2650.52550.49950.3246
CYC2008, SGD, MIPS
DAPG7150.61770.50220.4840
GMFTP2420.60780.55700.4070
ClusterONE1940.64650.53580.3549
MCL2540.51030.53080.3174
CFinder1830.38510.46370.2234
DCAFP8040.70240.45080.4001
RNSC2410.52550.54300.3314
MCODE1070.44790.44880.1651
SPICI910.58680.41950.2273
COREPEEL6900.57490.48820.4503
DSDCluster2650.52430.48910.3121

Cumulative histogram for predicted complexes matches with reference complexes based on MMR on small PPIs.

Matching predicted complexes to reference complexes cumulative histogram for various yeast PPI networks and references CYC2008. Figures on right column show how MMR varies when changing the overlap score.

Cumulative histogram for predicted complexes matches with reference complexes based on MMR on large PPIs.

Matching predicted complexes to reference complexes cumulative histogram for a large yeast PPI network using references CYC2008, and two Human PPI networks using gold standard CORUM. Figures on right column show how MMR varies when changing the overlap score. In addition, we show in Table 16 the number of predicted complexes that are correctly predicted (OS = 1.0) by DAPG and the state-of-the-art methods. We observed that GMFTP provides the greatest number of perfect matches in all small yeast references, except in Krogan Extended, where we get one more complex. We are second best, except on Krogan Core (where RNSC gets one more complex) and in Biogrid-yeast (where DSDCluster identifies 5 more complexes than DAPG, COREPEEL and RNSC). Also, in the human PPIs, we are second after COREPEEL.
Table 16

Number of predicted complexes with perfect matching with complexes in references (CYC2008 and CORUM) (OS = 1.0).

Small networks
ApproachCollinsKrogan CoreKrogan ExtendedGavin
DAPG51252328
GMFTP52302234
ClusterONE42231928
MCL4017819
CFinder38161120
DCAFP4433
RNSC45261524
MCODE249510
SPICI23121823
COREPEEL39261823
DSDCluster1117820
larger networks
ApproachDIP-yeastBiogrid-yeastHPRDBiogrid-human
DAPG222398
ClusterONE3181
MCL6172
CFinder13-4-
DCAFP802-
RNSC0281
MCODE3021
SPICI7111
COREPEEL1124611
DSDCluster107103
We also compared our algorithm with the most competitive methods, GMFTP and COREPEEL, based on some patterns we detected in the PPIs. We considered the four following complexes for yeast, described in the gold standard CYC2008. HIR complex: HIR1, HIR2, HIR3, HPC2 Phosphatidylinositol (PtdIns) 3-kinase complex (functions in CPY sorting): VPS15, VPS30, VPS34, VPS38 AP-3 Adaptor complex: APL5, APL6, APM3, APS3 EKC/KEOPS complex: CGI121, BUD32, GON7, KAE1 Fig 4 shows the results, where we include the graph pattern in which the complex is present in each PPI. We mark each complex with a ✔ mark if the method is able to detect the protein complex with OS > = 0.8 and with a ✘ mark otherwise.
Fig 4

Comparison detection results for a small dense subgraph pattern.

Besides, we found two more complexes in DIP-yeast that follow the same pattern, i.e., a clique of four proteins missing an edge in the PPI. In both cases DAPG detects them, but COREPEEL does not. These complexes are: alpha, alpha-trehalose-phosphate synthase complex: TPS1, TPS3, TPS2, TSL1 STE5-MAPK complex: FUS3, STE5, STE7, STE11 Finally, we performed a comparison based on the ability of the method to detect protein complexes with proteins that participate in more than one complex. We considered complexes in the CYC2008 gold standard. Table 17 shows how well each method detects these protein complexes.
Table 17

Performance comparison results based on Overlap Score (OS) in detecting overlapping complexes in Collins with gold standard CYC2008.

Collins
ProteinComplexDAPG OSGMFTP OSCOREPEEL OS
TAF14Ino80p1.0000.7581.000
TFIIF0.2310.750-
NuA3---
SWI/SNF---
TFIID---
SWD2Compass1.0001.0000.875
mRNA cleavage and polyadenylation0.9330.8710.871
ARP4, ACT1NuA40.9230.9230.923
Swr1p0.852-0.769
Ino80p---
NGG1SAGA0.7890.8950.895
SLIK0.663-0.420
Ada2p0.267--
TAF5, TAF6, TAF9, TAF10SAGA0.7890.8950.895
SLIK0.663-0.420
TFIID0.6670.7330.667
ARP7, ARP9RSC1.0001.0001.000
SWI/SNF0.8330.8330.750

False positive analysis

Predicting protein complexes is challenging because PPI networks are noisy and incomplete, and references are also incomplete and not systematically updated. All prediction techniques report false positives (i.e., predicted complexes that are not in references), although they can be real complexes not included in references or not discovered yet. In this work, we perform an automatic false positive evaluation of predicted complexes for yeast and human that are absent in available references. Our goal is to see if the reported false positives contain interesting gene sets. In this work, we analyze the reported false positives by looking into curated biological databases such as PDBe (Protein Database Bank in Europe, http://www.ebi.ac.uk/pdbe) which contain information about protein complexes that have purified and structurally characterized. Most of the protein complexes in PDBe are small and are absent in gold standards such as CYC2008 and CORUM, mainly because these gold standards have not been updated recently. In addition, PDBe does not have directly available a repository of all the protein complexes it contains. Therefore, here we propose an automated procedure to query the database to find out whether sets of genes are registered as purified complexes in PDBe. Our analysis do not include protein complexes already found in gold standards (i.e., CY2008, SGD, and MIPS for yeast, and CORUM and PCDq for human). In addition, we also include information of protein complexes that have been topologically characterized, a study done by Ahnert et al. [30] and available in the periodic table of protein complexes (http://www.periodicproteincomplexes.org). However, this periodic table is not up to date. In order to automate the procedure we use the following PDBe related databases. Uniprot (http://www.uniprot.org). To obtain protein ids related to pdb ids. EMBL-EBI Sifts (https://www.ebi.ac.uk/pdbe/docs/sifts/quick.html). To get chain information of proteins. PDBe REST API (http://www.ebi.ac.uk/pdbe/pdbe-rest-api). To query for specific PDB id entry summary information (structure, name, title, release dates). Protein Complex Periodic Table (http://www.periodicproteincomplexes.org). To query and visualize topology information of heteromeric complexes. The false positive automatic analysis can be summarized in the following steps. Obtain the yeast and human database including PDB ids from Uniprot database, and the Sifts database, which contain the protein domains or chains associated with proteins. For each false positive complex, we find the pdb ids for each protein with corresponding chains. We define a potential protein complex if the complex contains at least two proteins that share the same pdb id. We discard a potential protein complex if the complex is part of a protein complex in a gold standard. Look up the pdb ids of potential protein complexes using PDBe REST API database and checking whether it is a heteromeric complex or not based on the entry summary information. Look up the potential protein complex in the Complex Periodic Table and obtaining its information about of subunits and number of repeats as well as its topology. It is important to note that it might be a variation in the number of subunits and repeats with respect to the information on PDBe. This variation might be because the periodic table is not up to date. Tables 18 and 19 display a subset of candidate protein complexes in PDBe for yeast and human that are not in any gold standard and are present in the Periodic Table of Protein Complexes. The complete list of candidate protein complexes we detected for both organisms is available in the software distribution (files with extension .csv).
Table 18

Predicted complexes in Yeast not present in CYC2008, SGD, and MIPS references.

Column with Gene ids contains the genes we found in a complex (number of gene ids).

Pdb idForm nameGene idsPDBe TitleurlPeriodic Table
2cg9hetero tetramerHSP82 SBA1 (2/2)CRYSTAL STRUCTURE OF AN HSP90-SBA1 CLOSED CHAPERONE COMPLEX (release date: 20060412)http://www.ebi.ac.uk/pdbe/entry/pdb/2cg92 subunits, 2 repeats
3ruihetero tetramerATG7 ATG8 (2/2)Crystal structure of Atg7C-Atg8 complex (release date: 20111123)http://www.ebi.ac.uk/pdbe/entry/pdb/3rui2 subunits, 2 repeats
2z5chetero trimerIRC25 POC4 (2/3)Crystal Structure of a Novel Chaperone Complex for Yeast 20S Proteasome Assembly (release date: 20080122)http://www.ebi.ac.uk/pdbe/entry/pdb/2z5c3 subunits, 1 repeat
3m1ihetero trimerCRM1 GSP1 YRB1 (3/3)Crystal structure of yeast CRM1 (Xpo1p) in complex with yeast RanBP1 (Yrb1p) and yeast RanGTP (Gsp1pGTP) (release date: 20100602)http://www.ebi.ac.uk/pdbe/entry/pdb/3m1i3 subunits, 1 repeat
2r25hetero dimerSLN1 YPD1 (2/2)Complex of YPD1 and SLN1-R1 with bound Mg2+ and BeF3- (release date: 20080115)http://www.ebi.ac.uk/pdbe/entry/pdb/2r252 subunits, 1 repeat
2v6xhetero dimerDID4 VPS4 (2/2)STRACTURAL INSIGHT INTO THE INTERACTION BETWEEN ESCRT-III AND VPS4 (release date: 20071016)http://www.ebi.ac.uk/pdbe/entry/pdb/2v6x2 subunits, 1 repeat
2z5bhetero dimerIRC25 POC4 (2/2)Crystal Structure of a Novel Chaperone Complex for Yeast 20S Proteasome Assembly (release date: 20080122)http://www.ebi.ac.uk/pdbe/entry/pdb/2z5b2 subunits, 1 repeat
3cmmhetero dimerUBA1 UBI4 (2/2)Crystal Structure of the Uba1-Ubiquitin Complex (release date: 20080805)http://www.ebi.ac.uk/pdbe/entry/pdb/3cmm2 subunits, 1 repeat
3qmlhetero dimerKAR2 SIL1 (2/2)The structural analysis of Sil1-Bip complex reveals the mechanism for Sil1 to function as a novel nucleotide exchange factor (release date: 20110629)http://www.ebi.ac.uk/pdbe/entry/pdb/3qml2 subunits, 1 repeat
Table 19

Predicted complexes in Human not present in CORUM and PCDq references.

Pdb idForm nameGene idsPDBe TitleurlPeriodic Table
4aj5hetero 30-merSKA1 SKA2 SKA3 (3/3)Crystal structure of the Ska core complex (release date: 20120523)http://www.ebi.ac.uk/pdbe/entry/pdb/4aj53 subunits, 10 repeats
1zglhetero 20-merHLA-DRA HLA-DRB5 (2/5)Crystal structure of 3A6 TCR bound to MBP/HLA-DR2a (release date: 20051018)http://www.ebi.ac.uk/pdbe/entry/pdb/1zgl4 subunits, 4 repeats
2io3hetero 12-merSENP2 SUMO2 (2/3)Crystal structure of human Senp2 in complex with RanGAP1-SUMO-2 (release date: 20061114)http://www.ebi.ac.uk/pdbe/entry/pdb/2io33 subunits, 4 repeats
1d0ghetero hexamerTNFRSF10B TNFSF10 (2/2)CRYSTAL STRUCTURE OF DEATH RECEPTOR 5 (DR5) BOUND TO APO2L/TRAIL (release date: 19991022)http://www.ebi.ac.uk/pdbe/entry/pdb/1d0g2 subunits, 3 repeats
3l4ghetero tetramerFARSA FARSB (2/2)Crystal structure of Homo Sapiens cytoplasmic Phenylalanyl-tRNA synthetase (release date: 20100309)http://www.ebi.ac.uk/pdbe/entry/pdb/3l4g2 subunits, 2 repeats
1hcfhetero tetramerNTF4 NTRK2 (2/2)CRYSTAL STRUCTURE OF TRKB-D5 BOUND TO NEUROTROPHIN-4/5 (release date: 20011206)http://www.ebi.ac.uk/pdbe/entry/pdb/1hcf2 subunits, 2 repeats
4dxrhetero hexamerSUN2 SYNE1 (2/2)Human SUN2-KASH1 complex (release date: 20120606)http://www.ebi.ac.uk/pdbe/entry/pdb/4dxr1 subunit, 3 repeats
3oj4hetero trimerTNFAIP3 UBC UBE2D1 (3/3)Crystal structure of the A20 ZnF4 (release date: 20101208)http://www.ebi.ac.uk/pdbe/entry/pdb/3oj43 subunits, 1 repeat
1kmchetero tetramerCASP7 XIAP (2/2)Crystal Structure of the Caspase-7 / XIAP-BIR2 Complex (release date: 20020116)http://www.ebi.ac.uk/pdbe/entry/pdb/1kmc1 subunit, 2 repeats
2ibihetero dimerUBC USP2 (2/2)Covalent Ubiquitin-USP2 Complex (release date: 20061024)http://www.ebi.ac.uk/pdbe/entry/pdb/2ibi2 subunits, 1 repeat

Predicted complexes in Yeast not present in CYC2008, SGD, and MIPS references.

Column with Gene ids contains the genes we found in a complex (number of gene ids).

Discussion and conclusions

We have introduced a novel scheme for detecting protein complexes. Our approach is based on modeling PPI networks as directed acyclic graphs, which allowed us to design an efficient mining heuristic for detecting overlapping dense subgraphs considering weighted and unweighted PPI networks. We define protein complexes based on dense subgraphs that usually overlap. An important advantage of our approach is that it enables the easy extension of new traveler and objective functions. New traveler functions might improve the mining process for discovering dense subgraghs and new objective functions might include biological knowledge to discover subgraphs with biological significance. Therefore, further extensions to our framework are based on adding biological information that might improve the discovery of protein complexes or other protein relationships of biological relevance. We compare our results with state-of-the-art techniques and show that we provide good performance in terms of clustering using different gold standards and biological metrics, as well as good execution times. We show that our method is able to achieve very good results in terms of matching perfectly (OS = 1.0) protein complexes in the gold standards. We also provide a post-processing analysis to study false positive complexes that contain proteins in PPI networks that are absent in the gold standards. In order to study false positives, we consider the information available on protein complexes that have been purified and structurally characterized in PDBe. We used this information together with a recent approach that proposes a periodic table for protein complexes that studies different topologies according to the subunits that compose protein complexes. In this study we discovered that more than 50 yeast complexes and more than 300 of false positive human complexes, not present in gold standards, have actually been already characterized and their information is available in PDBe. Many of these complexes have also been found as having an associated type in the periodic table of protein complexes [30]. We propose these “new” real complexes discovered by our approach and already present in such structural databases, to be considered as new candidates for inclusion in the gold standards of protein complexes. Considering these results, we present our list of predicted false-positive protein complexes to the scientific community, conjecturing that at least part of them could be, in fact, true real complexes awaiting to be studied and characterized. Table A1: Mining algorithm. Discovering DSGs in DAPG. Table A2: Detection of an DSG starting at a given node in DAPG. Table A3: Algorithms for redundancy-filtering. Table A4-A17: DAPG results with different parameters and input PPI networks. Table A18-A29: Other method results with different parameters and input PPI networks. (PDF) Click here for additional data file.
Table 10

Performance comparison results of clustering and biological metrics in Krogan Core.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
Krogan CoreCYC2008
DAPG6490.67820.62420.50590.89760.70990.85332.19
GMFTP2870.60790.77310.53700.85240.67410.7026> 12hrs.
ClusterONE4110.58440.74090.50650.79370.65420.68301.65
MCL3770.42260.73620.41190.67940.59750.60728.62
CFinder1130.47190.54770.27830.72030.53290.76530.33
DCAFP3840.84940.58140.32780.85870.72690.9043640.06
RNSC2930.47320.69510.43780.79700.68180.61100.68
MCODE830.46150.52820.18290.78070.63450.72715.68
SPICI1330.57140.65810.32930.90760.71320.81250.18
COREPEEL7230.60420.60320.48690.87330.70860.78690.24
DSDCluster3680.42080.70440.40640.65790.56670.5667121.96
SGD
DAPG6490.62660.45190.4153
GMFTP2870.55360.55500.4270
ClusterONE4110.52610.55200.3833
MCL3770.36800.53360.2970
CFinder1130.40140.39940.2051
DCAFP3840.76370.42340.2842
RNSC2930.43400.50560.3220
MCODE830.37450.39500.1324
SPICI1330.53000.48810.2604
COREPEEL7230.54970.44060.3967
DSDCluster3680.38040.50410.3137
MIPS
DAPG6490.46120.37930.3085
GMFTP2870.39900.45970.3479
ClusterONE4110.34430.43630.3356
MCL3770.27290.43620.2681
CFinder1130.30300.34170.1638
DCAFP3840.63960.38350.2731
RNSC2930.28430.41420.2560
MCODE830.34150.36250.1257
SPICI1330.34430.40000.1952
COREPEEL7230.41180.36990.2829
DSDCluster3680.26720.41230.2720
CYC2008, SGD
DAPG6490.67600.42060.4115
GMFTP2870.59210.53270.3682
ClusterONE4110.58680.52840.3526
MCL3770.40070.51400.2677
CFinder1130.39390.38100.1849
DCAFP3840.79290.40480.2797
RNSC2930.45550.48630.2878
MCODE830.34360.37740.1149
SPICI1330.51280.45920.2164
COREPEEL7230.60530.40730.3943
DSDCluster3680.41350.48990.2805
CYC2008, SGD, MIPS
DAPG6490.67340.41160.4022
GMFTP2870.59140.52510.3578
ClusterONE4110.59180.51960.3487
MCL3770.40070.50410.2617
CFinder1130.38710.37370.1788
DCAFP3840.77560.39510.2752
RNSC2930.45900.47720.2836
MCODE830.34670.36780.1122
SPICI1330.50000.45130.2094
COREPEEL7230.60460.39810.3883
DSDCluster3680.48850.47990.2692
Table 11

Performance comparison results of clustering and biological metrics in Krogan Extended.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
Krogan ExtendedCYC2008
DAPG8300.54110.62260.47240.82680.67980.67838.33
GMFTP3640.45100.73890.45090.76340.61650.5792> 12 hrs.
ClusterONE4020.57510.70430.45510.79600.65460.67412.18
MCL4800.33280.71540.31130.59770.52310.498719.50
CFinder1180.29930.41260.16820.61540.44660.63651.43
DCAFP5190.73020.59280.33560.89240.74420.7343750.23
RNSC3260.35890.66570.33220.72330.63990.49230.24
MCODE550.28070.43650.10440.66870.51430.787213.12
SPICI1470.53640.63700.31260.87000.69710.71720.10
COREPEEL12230.48420.62360.45640.83020.68860.68840.26
DSDCluster5300.31050.66190.32500.58560.52120.4301480.08
SGD
DAPG8300.48360.44000.3662
GMFTP3640.44000.52210.3532
ClusterONE4020.49920.51870.3259
MCL4800.27080.50400.2121
CFinder1180.25310.31550.1312
DCAFP5190.65510.42440.2714
RNSC3260.32300.47540.2455
MCODE550.21620.31570.0761
SPICI1470.49690.46550.2424
COREPEEL1,2230.43500.44860.3762
DSDCluster5300.26390.47150.2408
MIPS
DAPG8300.37240.36790.2747
GMFTP3640.30560.44300.2980
ClusterONE4020.34170.41840.2904
MCL4800.20650.40750.1928
CFinder1180.20220.24910.1059
DCAFP5190.53920.37950.2451
RNSC3260.24950.39270.2165
MCODE550.20790.29380.0608
SPICI1470.32860.38040.1847
COREPEEL1,2230.33250.37870.2806
DSDCluster5300.18980.37490.2061
CYC2008, SGD
DAPG8300.53440.40760.3603
GMFTP3640.45820.50000.2974
ClusterONE4020.56060.49540.3013
MCL4800.31450.49060.1970
CFinder1180.23980.29640.1127
DCAFP5190.70450.40740.2699
RNSC3260.35170.46100.2186
MCODE550.21050.30360.0653
SPICI1470.48330.44160.2054
COREPEEL1,2230.49370.41510.3661
DSDCluster5300.30090.45380.2177
CYC2008, SGD, MIPS
DAPG8300.53620.39960.3563
GMFTP3640.45770.48970.2905
ClusterONE4020.57140.48590.2985
MCL4800.31690.47880.1930
CFinder1180.23760.28940.1096
DCAFP5190.70220.39680.2720
RNSC3260.35310.45080.2122
MCODE550.20180.29650.0628
SPICI1470.47960.43160.1989
COREPEEL1,2230.49430.40410.3612
DSDCluster5300.30180.49390.2093
Table 13

Performance comparison results of clustering and biological metrics in DIP-yeast.

Approach#CFMAccMMRGoSimColoc.SCTime(s)
DIP-yeastCYC2008
DAPG1,9250.38300.54860.44470.81330.66640.80826.23
ClusterONE1,0420.24360.62360.27940.63530.56820.44321.44
MCL5980.26850.62590.23890.59860.53550.45232.31
CFinder1980.27210.42720.15980.58430.41730.43713.02
DCAFP4920.72120.56310.29720.88970.71870.82893,848.32
RNSC5170.01080.29660.00630.80010.62180.10430.53
MCODE780.20070.37340.06630.67840.45460.802333.42
SPICI5170.30070.58260.23940.66500.56970.63420.12
COREPEEL7420.51600.56790.32390.82870.65000.82770.16
DSDCluster6450.27870.56880.26060.62330.54420.47282,520.67
SGD
DAPG1,9250.34730.40080.3620
ClusterONE1,0420.22360.46840.2179
MCL5980.23770.44540.1818
CFinder1980.21330.31710.1145
DCAFP4920.60890.40430.2329
RNSC5170.01020.21160.0053
MCODE780.16410.27840.0530
SPICI5170.28840.43220.1859
COREPEEL7420.48540.41530.2761
DSDCluster6450.25030.40790.2109
MIPS
DAPG1,9250.29920.34750.3607
ClusterONE1,0420.14220.36970.1865
MCL5980.16950.35980.1713
CFinder1980.17390.25840.1069
DCAFP4920.61810.37270.2649
RNSC5170.00290.17170.0014
MCODE780.15620.25720.0451
SPICI5170.21010.35610.1759
COREPEEL7420.39380.36190.2428
DSDCluster6450.17760.35250.1768
CYC2008, SGD
DAPG1,9250.41380.37690.3654
ClusterONE1,0420.26900.44410.2076
MCL5980.28350.43580.1725
CFinder1980.23660.30530.1045
DCAFP4920.67430.38060.2282
RNSC5170.00920.19910.0040
MCODE780.16910.26630.0485
SPICI5170.30410.41260.1651
COREPEEL7420.53950.38710.2695
DSDCluster6450.28660.39660.1896
CYC2008, SGD, MIPS
DAPG1,9250.42130.36840.3684
ClusterONE1,0420.27180.43680.2009
MCL5980.28320.42690.1646
CFinder1980.23890.30030.1039
DCAFP4920.67040.37230.2321
RNSC5170.00890.19380.0037
MCODE780.16060.25880.0462
SPICI5170.31310.40420.1620
COREPEEL7420.54370.37880.2711
DSDCluster6450.29140.38940.1840
  36 in total

1.  Cytoscape: a software environment for integrated models of biomolecular interaction networks.

Authors:  Paul Shannon; Andrew Markiel; Owen Ozier; Nitin S Baliga; Jonathan T Wang; Daniel Ramage; Nada Amin; Benno Schwikowski; Trey Ideker
Journal:  Genome Res       Date:  2003-11       Impact factor: 9.043

2.  Bootstrapping the interactome: unsupervised identification of protein complexes in yeast.

Authors:  Caroline C Friedel; Jan Krumsiek; Ralf Zimmer
Journal:  J Comput Biol       Date:  2009-08       Impact factor: 1.479

Review 3.  Structure, dynamics, assembly, and evolution of protein complexes.

Authors:  Joseph A Marsh; Sarah A Teichmann
Journal:  Annu Rev Biochem       Date:  2014-12-08       Impact factor: 23.643

4.  Detecting overlapping protein complexes in protein-protein interaction networks.

Authors:  Tamás Nepusz; Haiyuan Yu; Alberto Paccanaro
Journal:  Nat Methods       Date:  2012-03-18       Impact factor: 28.547

5.  Principles of assembly reveal a periodic table of protein complexes.

Authors:  Sebastian E Ahnert; Joseph A Marsh; Helena Hernández; Carol V Robinson; Sarah A Teichmann
Journal:  Science       Date:  2015-12-11       Impact factor: 47.728

6.  MIPS: analysis and annotation of proteins from whole genomes.

Authors:  H W Mewes; C Amid; R Arnold; D Frishman; U Güldener; G Mannhaupt; M Münsterkötter; P Pagel; N Strack; V Stümpflen; J Warfsmann; A Ruepp
Journal:  Nucleic Acids Res       Date:  2004-01-01       Impact factor: 16.971

7.  SPICi: a fast clustering algorithm for large biological networks.

Authors:  Peng Jiang; Mona Singh
Journal:  Bioinformatics       Date:  2010-02-24       Impact factor: 6.937

8.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

Authors:  Nevan J Krogan; Gerard Cagney; Haiyuan Yu; Gouqing Zhong; Xinghua Guo; Alexandr Ignatchenko; Joyce Li; Shuye Pu; Nira Datta; Aaron P Tikuisis; Thanuja Punna; José M Peregrín-Alvarez; Michael Shales; Xin Zhang; Michael Davey; Mark D Robinson; Alberto Paccanaro; James E Bray; Anthony Sheung; Bryan Beattie; Dawn P Richards; Veronica Canadien; Atanas Lalev; Frank Mena; Peter Wong; Andrei Starostine; Myra M Canete; James Vlasblom; Samuel Wu; Chris Orsi; Sean R Collins; Shamanta Chandran; Robin Haw; Jennifer J Rilstone; Kiran Gandi; Natalie J Thompson; Gabe Musso; Peter St Onge; Shaun Ghanny; Mandy H Y Lam; Gareth Butland; Amin M Altaf-Ul; Shigehiko Kanaya; Ali Shilatifard; Erin O'Shea; Jonathan S Weissman; C James Ingles; Timothy R Hughes; John Parkinson; Mark Gerstein; Shoshana J Wodak; Andrew Emili; Jack F Greenblatt
Journal:  Nature       Date:  2006-03-22       Impact factor: 49.962

9.  PCDq: human protein complex database with quality index which summarizes different levels of evidences of protein complexes predicted from h-invitational protein-protein interactions integrative dataset.

Authors:  Shingo Kikugawa; Kensaku Nishikata; Katsuhiko Murakami; Yoshiharu Sato; Mami Suzuki; Md Altaf-Ul-Amin; Shigehiko Kanaya; Tadashi Imanishi
Journal:  BMC Syst Biol       Date:  2012-12-12

10.  Going the distance for protein function prediction: a new distance metric for protein interaction networks.

Authors:  Mengfei Cao; Hao Zhang; Jisoo Park; Noah M Daniels; Mark E Crovella; Lenore J Cowen; Benjamin Hescott
Journal:  PLoS One       Date:  2013-10-23       Impact factor: 3.240

View more
  1 in total

1.  Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning.

Authors:  Jiajun Hong; Yongchao Luo; Yang Zhang; Junbiao Ying; Weiwei Xue; Tian Xie; Lin Tao; Feng Zhu
Journal:  Brief Bioinform       Date:  2020-07-15       Impact factor: 11.622

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.