Literature DB >> 35281831

An Ensemble Learning Framework for Detecting Protein Complexes From PPI Networks.

Rongquan Wang1, Huimin Ma1, Caixia Wang2.   

Abstract

Detecting protein complexes is one of the keys to understanding cellular organization and processes principles. With high-throughput experiments and computing science development, it has become possible to detect protein complexes by computational methods. However, most computational methods are based on either unsupervised learning or supervised learning. Unsupervised learning-based methods do not need training datasets, but they can only detect one or several topological protein complexes. Supervised learning-based methods can detect protein complexes with different topological structures. However, they are usually based on a type of training model, and the generalization of a single model is poor. Therefore, we propose an Ensemble Learning Framework for Detecting Protein Complexes (ELF-DPC) within protein-protein interaction (PPI) networks to address these challenges. The ELF-DPC first constructs the weighted PPI network by combining topological and biological information. Second, it mines protein complex cores using the protein complex core mining strategy we designed. Third, it obtains an ensemble learning model by integrating structural modularity and a trained voting regressor model. Finally, it extends the protein complex cores and forms protein complexes by a graph heuristic search strategy. The experimental results demonstrate that ELF-DPC performs better than the twelve state-of-the-art approaches. Moreover, functional enrichment analysis illustrated that ELF-DPC could detect biologically meaningful protein complexes. The code/dataset is available for free download from https://github.com/RongquanWang/ELF-DPC.
Copyright © 2022 Wang, Ma and Wang.

Entities:  

Keywords:  biological information; ensemble learning; graph clustering algorithms; network embedding; protein complexes; protein-protein interaction networks

Year:  2022        PMID: 35281831      PMCID: PMC8908451          DOI: 10.3389/fgene.2022.839949

Source DB:  PubMed          Journal:  Front Genet        ISSN: 1664-8021            Impact factor:   4.599


1 Introduction

Most complex systems, such as biological systems and human society, can be presented as complex networks in the real world. Social networks, biological networks, brain networks, citation networks, and protein-protein interaction networks are examples of complex networks (Pourkazemi and Keyvanpour, 2017). Community detection in complex networks is essential in many fields, aiming to identify clusters with high internal connectivity. These clusters are well separated from the rest of the network. Over the past several years, the study of community identification in complex networks has grown popular. Community detection is a fundamental problem in network analysis that tries to mine the hidden structure of a specific complex network (Fortunato, 2010; Abduljabbar et al., 2020). In bioinformatics, the crucial topic is to mine protein complexes in PPI networks. Proteins usually interact with each other, forming protein complexes to accomplish their biological functions (Gavin et al., 2002; Spirin and Mirny, 2003). As a community structure in the PPI network, it may be the natural protein complex, and the proteins in the protein complex should be highly interconnected (Girvan and Newman, 2002; Chen et al., 2014). The truth is that the prediction of protein complexes is essential for studying cellular organization theory and understanding protein complex formation. Biologically, a protein complex is a group of proteins formed by interacting simultaneously and in place. The detection of protein complexes using biological experiments is both costly and time-consuming. With the development of high-throughput experimental methods, many PPI networks have been produced, which usually have small world, scale-free, and modularity characteristics. They could be formulated as graphs where the nodes represent the proteins, and the edges represent the interactions. Therefore, many computational algorithms present alternate ways to automatically discover protein complexes from the PPI networks. More details on the related work are introduced in the related work section.

1.1 Related Work

During the past decade, various computational methods have been presented to identify protein complexes in PPI networks. We will briefly review the related work from three aspects. The first is identifying protein complexes based on unsupervised learning-based methods. Another type of identifying protein complex methods is based on a model optimization-based method. The last type of identifying protein complex methods is based on supervised learning-based methods.

1.1.1 Unsupervised Learning-Based Methods

Many researchers hypothesize that subgraphs with different topological structures in PPI networks are factual protein complexes (Wang et al., 2010) such as density, k-clique, and core-attachment structures. Most of these methods are either global heuristic search, local heuristic search, or both. Meanwhile, some methods integrate topological and biological information to further improve the accuracy of detecting protein complexes. Many local heuristic-based methods have been proposed to identify protein complexes. For instance, Altaf-Ul-Amin et al. (Altaf-Ul-Amin et al., 2006) developed DPClus, which generates clusters by ensuring density and checking the periphery of the clusters. Gavin et al. (Gavin et al., 2006) studied the organization of protein complexes, demonstrating that a protein complex generally contains a unique protein complex core and attachment proteins, called a core-attachment structure. Here, proteins in a protein complex core have relatively more reliable interactions among themselves. The attachment proteins are the surrounding proteins of the protein complex core to assist it in performing related functions (Lakizadeh et al., 2015). Wu et al. (Wu et al., 2009) proposed a classic protein complex discovery method (COACH) using the core-attachment structure. COACH first detects protein complex cores and then identifies its attachment proteins to form a whole protein complex. Peng et al. (Peng et al., 2014) designed a PageRank Nibble strategy to give adjacent proteins different probabilities with core-attachment structures and proposed WPNCA to predict protein complexes. Nepuse et al. (Nepusz et al., 2012) presented ClusterONE, which utilizes a demanding growth process to mine subgraphs with high cohesiveness that may be protein complexes. Recently, Wang et al. (Wang et al., 2020) presented a new graph clustering method using a local heuristic search strategy to detect static and dynamic protein complexes. These local heuristic methods have strong local searchability, but finding an optimal global solution is difficult. Meanwhile, some global heuristic-based methods have been proposed to identify protein complexes. In 2009, Liu et al. (Liu et al., 2009) used an iterative method to weight PPI networks and developed a maximal clique-based method (CMC) to discover protein complexes from weighted PPI networks. Wang et al. (Wang et al., 2012) were inspired by the hierarchical organization of GO annotations and known protein complexes. Then they proposed OH-PIN, which is based on the concepts of overlapping M-clusters, λ-module, and clustering coefficients to detect both overlapping and hierarchical protein complexes in PPI networks. PC2P (Omranian et al., 2021) is a parameter-free greedy approximation algorithm casts the problem of protein complex detection as a network partitioning into biclique spanned subgraphs, which include both sparse and dense subgraphs. Although these global heuristic search methods have a strong global search ability, they require considerable time and computing resources. Recently, some methods based on network embedding strategies have been used to detect protein complexes. DPC-HCNE (Meng et al., 2019) is a novel protein complex detection method based on hierarchical compressing network embedding and core-attachment structures. It can preserve both the local topological information and global topological information of a PPI network. CPredictor 5.0 (Yao et al., 2019) uses the network embedding method Node2Vec (Grover and Leskovec, 2016) to learn node feature vector representation and then calculates the node embedding similarity and the functional similarity between interacting proteins to construct the weight PPI networks. These methods illustrate that employing the network embedding method could improve the accuracy of protein complex identification. It is well known that PPI networks contain many false-positive and false-negative interactions, i.e., noise. To overcome the noise of the PPI networks, some studies try to exploit biological information, such as gene expression data (Keretsu and Sarmah, 2016), gene ontology (GO) data (Wang et al., 2019; Yao et al., 2019), and subcellular localization data (Lei et al., 2018) to complement the interactions in PPI networks. CPredictor2.0 (Xu et al., 2017) effectively detects protein complexes from PPI networks, and first groups proteins based on functional annotations. Then, it applies the MCL algorithm to detect dense clusters as protein complexes. Zhang et al. (Zhang et al., 2016) calculated the active time point and the active probability of each protein and constructed dynamic PPI networks. Then a novel method was proposed based on the core-attachment structure. Zhang et al. (Zhang et al., 2019) proposed a novel method based on the core-attachment structure and seed expansion strategy to identify protein complexes using the topological structure and biological data in static PPI networks. ICJointLE (Zhang et al., 2019) is a novel method to identify protein complexes with the features of joint colocalization and joint coexpression in static PPI networks. NNP (Zhang et al., 2021) is a new method for recognizing protein complexes by topological characteristics and biological characteristics. Some methods (Zaki et al., 2013; Wang et al., 2019) are based on topological information to weight interactions in PPI networks. For example, PEWCC (Zaki et al., 2013) is a novel graph mining method that first assesses the reliability of the interactions and then detects protein complexes based on the concept of the weighted clustering coefficient. These methods have shown that the accuracy of protein complex identification can be significantly improved by integrating network topological structure and multiple biological information.

1.1.2 Model Optimization-Based Methods

Several recent methods suggested that identifying protein complexes or community structures can be an optimization problem using network topology and protein attributes. For example, RNSC (King et al., 2004) attempts to find an optimal set of partitions of a PPI network graph by employing different cost functions for detecting protein complexes. RSGNM (Zhang et al., 2012) is a regularized sparse generative network model that adds another process that generates propensities into an existing generative network model for protein complex identification. EGCPI (He and Chan, 2016) formulates the problem as an optimization problem to mine the optimal clusters with densely connected vertices in the PPI networks to discover protein complexes. DPCA (Hu et al., 2018) formulates the problem of detecting protein complexes as a constrained optimization problem according to protein complexes’ topological and biological properties. In particular, it is an algorithm with high efficiency and effectiveness. GMFTP (Zhang et al., 2014) is a generative model to simulate the generative processes of topological and biological information, and clusters that maximize the likelihood of generating the given PIN are considered protein complexes. DCAFP (Hu and Chan, 2015) transforms the problem of identifying protein complexes into a constrained optimization problem and introduces an optimization model by considering the integration of functional preferences and dense structures. He et al. (He et al., 2019) introduced a novel graph clustering model called contextual correlation preserving multiview featured graph clustering (CCPMVFGC) for discovering communities in graphs with multiview features, viewwise correlations of pairwise features and the graph topology. VVAMo (He et al., 2021a) is a novel matrix factorization-based model for communities in complex network. It proposes a unified likelihood function for VVAMo and derives an alternating algorithm for learning the optimal parameters of the proposed model. In 2017, Zhang et al. (Zhang et al., 2017) proposed a new firefly clustering algorithm for transforming the protein complex detection problem into an optimization problem. IMA (Wang et al., 2021) is a novel improved memetic algorithm that optimizes a fitness function to detect protein complexes. These model optimization-based methods usually have more parameters and variables, and the parameter optimization process is time-consuming. However, these methods also have some significance for us to transform the identification of protein complexes into an optimization problem.

1.1.3 Supervised Learning-Based Methods

The methods mentioned above are either unsupervised learning-based or model optimization-based methods that identify protein complexes using predefined assumptions and determined models. Unsupervised learning-based methods do not need to resolve practical problems, such as insufficient feature extraction from known protein complexes, model selection, and model training. Those methods cannot utilize the information of known protein complexes, and they neglect some other topological protein complexes such as the ‘star’ mode and ‘spoke’ mode and so on. Generally, supervised learning-based methods first train a supervised learning model by extracting features, and then trained supervised learning models are used to search new protein complexes. Many standard protein complex datasets have been obtained in recent years. Therefore, several supervised learning-based methods based on training regression or classification models are proposed to discover protein complexes from PPI networks. For example, Qi et al. (Qi et al., 2008) proposed a framework to learn the parameters of the Bayesian network model for discovering protein complexes. Yu et al. (Yu et al., 2014) presented a supervised learning-based method to detect protein complexes, which used cliques as initial clusters and selected a trained linear regression model to form protein complexes. Lei et al. (Shi et al., 2011) proposed a semisupervised algorithm, and trained a neural network model to detect protein complexes. ClusterEPs (Liu et al., 2016) estimated the possibility of a subgraph being a protein complex by emerging patterns (EPs). Dong et al.(Dong et al., 2018) provided the ClusterSS method, which integrates a trained neural network model and local cohesiveness function to guide the search strategy to identify protein complexes. Liu et al. (Liu et al., 2018) proposed a supervised learning method based on network embeddings and a random forest model for discovering protein complexes. Based on the decision tree, Sikandar et al. (Sikandar et al., 2018) presented a method using biological and topological information to detect protein complexes. Liu et al.(Liu et al., 2021) proposed a novel semisupervised model and a protein complex detection algorithm to identify significant protein complexes with clear module structures from PPI networks. Mei et al. (Mei, 2022) proposed a computational method that combines supervised learning and dense subgraph discovery to predict protein complexes. On the one hand, the accuracy of these detection methods based on semisupervised learning or supervised learning is limited due to the small training dataset. On the other hand, these methods only train a single type of learning model, so these models are not so generalizable and their learning ability has certain limitations. Some existing studies show that graph neural networks (GNNs) methods can effectively learn graph structure and node features. For example, Kipf et al. (Kipf and Welling, 2016) presented a scalable approach for semisupervised learning on graph-structured data. The proposed graph convolutional network (GCN) model is based on an efficient variant of convolutional neural networks. It can encode both graph structure and node features in a way useful for semisupervised classification. In 2021, Zaki et al. (Zaki et al., 2021) introduced various GCN approaches to improve the detection of protein complexes. graph attention networks (GATs), which aggregate neighbor nodes through the attention mechanism, realize the adaptive allocation of weights of different neighbors, thus greatly improving the expression ability of GNN models. He et al. (He et al., 2021b) proposed a class of novel learning-to-attend strategies, named conjoint attentions (CAs) to construct graph conjoint attention networks (CATs) for GNNs. CAs offer flexible incorporation of layerwise node features and structural interventions that can be learned outside the GNNs to compute appropriate weights for feature aggregation. We will study the detection of protein complexes in PPI networks using GATs in the future.

1.2 Observations and Contributions

Based on the related work, assigning weights to the interacting edges by the network embedding method and multiple biological information can effectively improve the accuracy of the detection methods. Meanwhile, some studies have shown that protein complexes have core-attachment structures. Therefore, our ELF-DPC is based on a core-attachment structure, and we constructed a weighted PPI network. Second, we proposed a protein complex core strategy to mine local protein complex cores. We identified global protein complex cores using the CPredictor2.0 method, which endows our ELF-DPC with both global search ability and local search ability. Third, most current methods are based on either unsupervised learning or supervised learning. Unsupervised learning-based methods can detect only one or several topological protein complexes and cannot fully learn the characteristics of known protein complexes. Supervised learning-based methods can learn the characteristics of known protein complexes, detecting protein complexes with different topological structures. Still, current supervised learning-based methods are based on a single base model for training. However, the generalization of a single model is poor. Therefore, we propose an ensemble learning model consisting of a trained voting regression model based on different types of base regression models and structural modularity to detect protein complexes with different topological structures. Finally, we proposed a graph heuristic search strategy to extend each protein complex core to form a protein complex. The results obtained show that ELF-DPC attained superior performances over 12 state-of-the-art methods. Furthermore, functional enrichment analysis results of ELF-DPC showed higher biological relevance by GO enrichment analysis. To summarize, we make the following contributions: • We introduce a protein complex core mining strategy based on the core-attachment structure and design a graph heuristic search strategy to search protein complexes. • We propose structural modularity to describe the inherent topological organization of protein complexes. • We present some new topological features and design an ensemble learning model by combining structural modularity and a voting regression model, which quantifies the possibility for a cluster as a protein complex. • We present an ensemble learning framework to identify protein complexes, and it achieves better performance than other competing methods. The rest of this study is organized as follows. The Materials and methods section introduces the datasets, terminologies, and methods. The Experiments and results section describes evaluation metrics and parameter selection and compare ELF-DPC with the competing methods. Finally, the Conclusion section provides a conclusion and future work.

2 Materials and Methods

2.1 Datasets

2.1.1 Protein-Protein Interaction Networks

In this paper, we used the four PPI networks for the experiments, i.e., Gavin (Gavin et al., 2006), Krogan core (Krogan et al., 2006), DIP (Xenarios et al., 2002), and MIPS (Güldener et al., 2006). The detailed properties of these PPI networks are shown in Table 1. Here, the self-interactions and duplicate interactions were eliminated.
TABLE 1

The detailed properties of the protein-protein interaction datasets.

DatasetNumber of nodeNumber of edgeDensity
Gavin18557,6690.004 459 796 985
Krogan core2,6747,0750.001 979 684 934
DIP4,93017 2010.001 415 721 912 41
MIPS4,55312 3180.001 188 694 605 27
The detailed properties of the protein-protein interaction datasets.

2.1.2 Standard Protein Complexes

We used two standard protein complexes that were constructed in the literature (Wang et al., 2020). Their properties are shown in Table 2. Here, standard protein complexes 1 consists of the known protein complexes from MIPS (Mewes et al., 2004), SGD (Hong et al., 2007), TAP06 (Gavin et al., 2006), ALOY (Aloy et al., 2004), CYC 2008 (Pu et al., 2009), and NEWMIPS (Friedel et al., 2009). Standard protein complexes 2 is also a combined protein complex dataset (Ma et al., 2017). It consists of the Wodak database (Pu et al., 2009), PINdb and GO complexes (Ma et al., 2017).
TABLE 2

The properties of the standard protein complexes.

DatasetsNumberProtein coverageAvg size
standard protein complexes 18122,7738.92
standard protein complexes 21,0452,7788.97
The properties of the standard protein complexes.

2.1.3 GO Annotation Data and Gene Expression Data

In this study, we used the GO-slim data for describing the functional similarity of interactions, which is available on the link: https://downloads.yeastgenome.org. Meanwhile, the gene expression data were obtained from https://www.ncbi.nlm.nih.gov/sites/GDSbrowser. Additionally, subcellular localization data was obtained from https://compartments.jensenlab.org/Downloads. The framework of ELF-DPC algorithm.

2.2 Terminologies

Here, we will give some terminologies that are used in this paper. A PPI network is generally described as a weighted graph G = (V, E, W), where V is a set of proteins, E is a set of interactions, and W is a n × n(n = |V|) matrix that represents the reliability of protein pairs in PPI networks. The direct interacting neighbor of node v is defined as N = {u|(u, v) ∈ E, u ∈ V}.

2.3 Methods

2.3.1 The Framework of ELF-DPC Algorithm

This work is a novel ensemble learning framework to identify protein complexes from PPI networks. The block diagram of the detection process is shown in Figure 1.
FIGURE 1

The ensemble framework of proposed protein complex detection.

The ensemble framework of proposed protein complex detection. The framework of this method is outlined in Algorithm 1. The input to the algorithm is the PPI network, which produces a set of protein complexes as output. Our algorithm consists of five main steps. The first step is to construct a weighted PPI network by combining topological structure, gene expression data, GO annotation data, and subcellular location data in Line 2 (Constructing a weighted PPI network section). The second step is to design a protein complex core mining strategy to identify protein complex cores in the PPI networks (Mining protein complex cores section) in Line 3. The third step is first to construct feature vectors to describe the properties of known and false protein complexes in the PPI networks and train a voting regression model (Training a voting regression model section) to model and represent the protein complex based on supervised learning in Line 5. Then second, we define a quality function called structural modularity to describe the structural modularity of protein complexes. Then we combine the trained voting regression model and structural modularity to obtain an ensemble learning model in Line 6. In the fourth step, based on the ensemble learning model, we propose a graph heuristic search strategy (Forming protein complexes section) to extend each protein complex core for forming protein complexes from the PPI networks in Lines 7–14. Finally, we remove these redundant identified protein complexes in Line 15.

2.3.2 Constructing a Weighted PPI Network

Some studies have confirmed that the performance of protein complex detection could be markedly enhanced when the weight of edges is considered (Keretsu and Sarmah, 2016; Lei et al., 2018). Meanwhile, integrating multiple data sources into a PPI network can strengthen the reliability of the PPI networks (Lei et al., 2018; Wang et al., 2020), which inspires us with confidence to give the weight for interactions. Moreover, a protein complex consists of proteins and interactions among themselves, and the proteins in the same protein complex are coexpressed and have a similar function and localization. Thus, we integrate multiple pieces of information, including gene expression data, protein localization data, and gene ontology data, to weight the interactions within the PPI networks.

2.3.2.1 Protein Coexpression Similarity

Generally, for a pair of interacting proteins, their coexpression level can reflect the strength of their interactions. Proteins with coexpressed relationships may also have similar functions (Eisen et al., 1998) and show stronger consistency of functions (Chen and Xu, 2004). Some studies have shown that coexpressed protein pairs tend to interact in the same protein complexes (Keretsu and Sarmah, 2016). Furthermore, the Person correlation coefficient (PCC) was used to estimate how strongly two interacting proteins are coexpressed (Lei et al., 2016; Shang et al., 2016). For a pair of proteins X and Y, their gene expression profiles are X = {x 1, x 2, … , x , … , x } and Y = {y 1, y 2, … , y , … , y }, respectively. The value of their PPC is defined as Eq. 1 (Wang et al., 2013). where and are the average gene expression of proteins X and Y at n time points, respectively. The value of PCC(X, Y) ranges from -1 to 1. For convenience, we use (PCC(X, Y) + 1)/2 to replace PCC(X, Y), which sets the value of PCC(X, Y) in (0,1). The value of PCC(X, Y) is higher, and then the coexpression probability of nodes X and Y is larger. At the same time, they could consist of the same protein complex.

2.3.2.2 Protein Functional Similarity

From a functional standpoint, we use GO-slim data to reflect the functional similarity of proteins. If a pair of proteins have more common GO-slim annotations, they are more likely to have the same biological function. Even the reliability of interactions between them will become stronger. Here, we let FS(X, Y) describe this relationship, which is defined as Eq. 2: where |FS(X)| and |FS(Y)| represent the number of GO-slim annotations for proteins X and Y, respectively. |FS(X) ∩ FS(Y)| denotes the number of common GO-slim annotations for proteins X and Y.

2.3.2.3 Protein Subcellular Location Similarity

Generally, if two interacting proteins have more exact subcellular locations, the interaction between proteins is more reliable. Here, we define the subcellular location similarity SL(X, Y), which is defined as Eq. 3: where |SL(X)| and |SL(Y)| denote the number of subcellular localizations of proteins X and Y, respectively. |SL(X) ∩ SL(Y)| represents the number of common subcellular localizations between proteins X and Y.

2.3.2.4 Protein Topological Structure Similarity

The network embedding method is a representation learning technique for representing the network’s nodes, which can automatically learn topological information from PPI networks. In this study, we use the network embedding method Node2Vec (Grover and Leskovec, 2016) to learn low-dimensional feature representations for the structural information of the proteins in a PPI network. For proteins X and Y, their representations are two vectors, namely, X and Y. Meanwhile, the obtained protein embedding vectors by node2vec can reflect the topological structure similarity among proteins, and we use cosine similarity to calculate the similarity of vector representation of proteins X and Y, which is defined as Eq. 4: where F(X) = (x 1, x 2, … , x , … , x ) and F(Y) = (y 1, y 2, … , y , … , y ) is the n dimension of the corresponding vector. TSS(X, Y) indicates the topological structure similarity of two connecting proteins, X and Y. For each edge, its weighted value W(X, Y) is expressed by Eq. 5: when the edges, whose weight is 0, are noise and should be removed from the PPI networks. Finally, we integrate topological structure similarity and biological information similarity, which can enhance the reliability of PPI networks. Therefore, a weighted PPI network is constructed.

2.3.3 Mining Protein Complex Cores

According to the constructing a weighted PPI network section, the weight of interactions is weighted using multiple biological properties and its topological structure, so the higher weight the edge has, the more likely it is that two terminate proteins are inside the same protein complex (Wang et al., 2011; Li et al., 2012). Furthermore, the protein complex cores often correspond to dense subgraphs in PPI networks (Wu et al., 2009; Wang et al., 2019). The pseudocode of mining protein complex cores is presented in Algorithm 2. First, for the edge (v, u), its weight is w(v, u), and its neighborhood graph is denoted as NG(v, u) = (V*, E*, W*), where V* = N ∪ N ∪ {v, u}. Furthermore, the average weighted degree of NG(v, u) is denoted as AWD(NG(v, u)) (Eq. 6): Based on the analysis above, we propose a score function (Eq. 7) to score seed edges based on the weight of the edge w(v, u) and the average weighted degree of the neighborhood graph of the edge (Eq. 6) to select seed edges in Line 1. Then, we sort all edges in nonascending order based on the score function (see Eq. 7) in the PPI networks. Only edges whose score function is greater than the mean of the score function of all edges are queued into Q. Seed edges in Q will mine protein complex cores in Line 2. As a result, the score function of edge (v, u) is defined as Eq. 7: For an edge (v, u) ∈ E, its edge clustering coefficient (ECC(v, u)) is defined as the number of triangles to which (u, v) belongs, divided by the number of triangles that might potentially include (u, v), as shown in Eq. 8. where Z(v, u) denotes the number of triangles built on edge (v, u), and min(| deg(v)|, | deg(u)|) is the minimum degree of the two terminate proteins. Initially, select the protein with the highest weight edge as the first seed edge (v, u), and create a protein complex core in Line 6, where neighbors of the complex core are added to both the weight of edge w(x, t) ≥ Avgedgesweight (Avgedgesweight is defined as Eq. 9) and ECC(x, t) is greater than the average edge clustering coefficient ECC of all edges (AvgweightECC), according to the closeness between the seed edge (v, u) and its neighbors in Lines 9–17. These two constraints can ensure that the proteins in the protein complex core are correlated in biological relations and closely connected in topological structure. The protein complex core is retained if it contains more than or equals two proteins in Lines 18–20. Meanwhile, the seed edge (including two terminate proteins) would be marked and cannot be used as the seed edge of another cluster in Lines seven and eight. We select the next edge with the highest weight where its two terminal proteins are not included before seed edges, and it is used to form the next protein complex core until the seed queue Q is empty in Lines 6–22. CPredictor2.0 (Xu et al., 2017) is also employed to detect global protein complex cores. Here, CPredictor2.0 detects protein complexes using MCL and protein functional information. It first discovers clusters in each functional group using the Markov clustering algorithm and merges them with higher overlap. We use CPredictor2.0 to obtain global protein complex cores (CPrclusters) in Line 23. Next, we combine these local protein complex cores by a graph heuristic search method and global protein complex cores using the CPredictor2.0 method in Line 24. Here, Algorithm 2 identifies the protein complex cores, which may have some redundant protein complex cores. For these redundant protein complex cores, we only keep one of them in the list of protein complex cores in Line 25. Mining protein complex cores.

2.3.4 Obtaining an Ensemble Learning Model

2.3.4.1 Training a Voting Regression Model

To obtain the trained regression model, we will follow several steps. First, we collect the known protein complexes and weighted a weighted PPI network based on Eq. 5. Second, we map these known protein complexes to the weighted and unweighted PPI networks to obtain mapped protein complexes. Third, we generate false protein complexes in current weighted and unweighted PPI networks based on the same size distribution of mapped protein complexes. Then we analyze the topological properties of known and false protein complexes. Fourth, we extract and select topological features from these mapped protein complexes and false protein complexes. Fifth, we chose an appropriate regression model and train it. Finally, we obtained the trained regression model. The whole training routine is illustrated in Figure 2.
FIGURE 2

A procession of training a regression model.

A procession of training a regression model. Next, we mainly introduce the differences and contributions between this study and previous research works. Obtaining known protein complexes from the database of standard protein complexes 1 and 2 (Wang et al., 2020) is very important, because they are used as factual protein complexes for training a model. Note that the protein complex has more than or equal to three proteins. Given machine learning, the quality of the training dataset is vital to model training. Previous methods generally construct false protein complexes by randomly selecting nodes in the graph. It has two disadvantages: it does not guarantee that the generated subgraphs are connected graphs and they cannot reflect the veracity of the topology of subgraphs in PPI networks. Therefore, we propose a false protein complex generating strategy. First, standard protein complexes are mapped to the PPI networks. Note that some standard protein complexes could not be mapped to the PPI networks, so the number of mapped protein complexes is generally less than the number of standard protein complexes. Second, we analyze the size distribution of the mapped protein complexes, and the size distribution of the generated false protein complexes follow the same power-law distribution. Third, according to the size distribution of the mapped protein complexes, we generate false protein complexes by randomly selecting the local neighborhood subgraphs in the PPI networks. Here, false protein complexes whose neighborhood affinity NA(A, B) (Eq. 15) with known protein complexes is less than 0.2. Finally, the ratio between the number of false protein complexes and the number of mapped protein complexes was 5 to 1. For selecting the parameter ratio, please see the parameter selection section. In this paper, both known and false protein complexes in the PPI networks are modeled as weighted and unweighted undirected graphs. The weight is calculated based on Eq. 5. Extracting and selecting appropriate features are essential to distinguish between factual and false protein complexes. Previous supervised learning methods rely on finding cliques, triangles, rectangles, spokes, and star graphs to mine protein complexes in PPI networks. Of course, we can use topological features such as degree statistics, node size, and edge statistics. On the one hand, we use some existing topological features for protein complex identification. On the other hand, we propose some topological features to describe the topological properties of protein complexes. We use 65 topological features to represent protein complexes in the PPI networks. Table 3 presents the list of topological features we used. Some topological features are extracted from the unweighted and weighted PPI networks. The implementation details about these topological features are well described in https://github.com/RongquanWang/ELF-DPC/Methods/Feature_selection.py. Additionally, if the reader discovers other relevant and valid topological features, please use them to represent protein complexes further.
TABLE 3

The topological features are used for representing protein complexes.

NumFeature nameNumFeature name
1Graph entropy2Graph weight entropy
3Node size4Edge size
5Graph clustering coefficient6Maximum degree
7Minimum degree8Mean degree
9Median degree10Variance degree
11standard deviation degree12Maximum weight degree
13Minimum weight degree14Average weight degree
15Median weight degree16standard weight degree
17Graph density18Graph weight density
19Edge mean weight20Edge median weight
21Edge variance weight22Edge standard weight
23Average shortest path length24Graph diameter
25Maximum Clustering Coefficient26Minimum Clustering Coefficient
27Mean Clustering Coefficient28Median Clustering Coefficient
29Variance Clustering Coefficient30Graph conductance
31Graph weight conductance32Modularity score
33Weight modularity score34Average boundary edge weight
35Average edge modularity36Average common neighbor
37Standard common neighbor38Variance common neighbor
39Minimum common neighbor40Median common neighbor
41Maximum common neighbor42Mean topological features
43Median topological feature44Variance topological feature
45Maximum topological feature46Minimum topological feature
47Standard topological feature48Mean Degree correlation
49Minimum Degree correlation50Variance Degree correlation
51Maximum Degree correlation52Median Degree correlation
53Community model54Weight community model
55Topological Change 156Topological Change 2
57Topological Change 358Topological Change 4
59Topological Change 560Topological Change 6
61Topological Change 762Topological Change 8
63First Eigenvalues 164First Eigenvalues 2
65First Eigenvalues 3
The topological features are used for representing protein complexes. Ensemble learning combines multiple individual learners with certain strategies to form a learning committee, so that the overall generalization performance is greatly improved. In general, the generalization capability of an ensemble learner model is much greater than the generalization capability of a single learner model. Meanwhile, we know that there is a barrel theory so we focus on two major standards: accuracy and diversity: • Accuracy: The individual learner must not be too bad, but it must be accurate. • Diversity: The output of individual learners should be different from each other. Therefore, producing and combining “good but different” individual learners is the core of ensemble learning. The VotingRegressor model is one of the most efficient ensemble learning techniques to reduce the variance and improve detection accuracy. In this paper, we use a VotingRegressor model based on several base models for training. A VotingRegressor is an ensemble meta-estimator that fits several base estimators and averages the individual predictions to form a final prediction. Here, linear regression, BayesianRidge, DecisionTreeRegressor, and SVM. SVR (kernel = “linear”) are used as the base estimators to build the VotingRegressor model. We select the VotingRegressor model due to its reduced variance in individual base estimators and better generalization capabilities, and the Voting Regressor model has more robustness than a single estimator. In this study, the VotingRegressor model and base estimators use default parameters. These models are a freely available machine learning tool used on scikit-learn (Pedregosa et al., 2011), and they can be determined by the website https://scikit-learn.org/stable/supervised_learning.html#supervised-learning. As a result, a trained VotingRegressor model could be used to estimate the probability of a subgraph being a natural protein complex from a supervised learning perspective to detect protein complexes with various topological structures. The score of the VotingRegressor is based on the higher probability that it is an actual protein complex. The VotingRegressor is defined as Eq. 10a and Eq. 10b:

2.3.4.2 The Structural Modularity of Protein Complexes

Based on the within-module and between module edges of subgraphs and the size of the subgraph, we present a new formal definition of protein complexes in PPI networks (Wu et al., 2009; Yu et al., 2011; Nepusz et al., 2012; Wang et al., 2019). Given the new module definition, an effective method of quantitative measurement is introduced to estimate the likelihood of a cluster C = (V , E , W ) being a protein complex in the PPI network. We introduce a structural modularity (SM) model to estimate the likelihood of a cluster C = (V , E , W ) being a protein complex, which can detect both dense and sparser protein complexes in PPI networks. First, structural modularity (SM) is combined by Cohesion(C) and Coupling(C), and Cohesion(C) is defined as Eq. 11 and Coupling(C) is defined as Eq. 12. where denotes the total weight of the internal edges contained entirely in cluster C, and |C| is the number of nodes in the cluster C. Cohesion(C) could estimate a protein complex with a community structure having dense connections among its nodes. Here, Cohesion(C) is based on the definition of density of a cluster C by density multiplied by the square root of the size of cluster C to quantify the likelihood that a cluster is a protein complex. The idea of Cohesion(C) is that a protein complex in the PPI network is usually relatively sparse, so Cohesion(C) is used to adopt density as the quality function, and it may be more appropriate. where W (C) = ∑ w(v, u) represents the total weight of the boundary edges that connect the cluster C with the rest of the PPI network, and it can measure that the cluster C has sparse connections with its neighbor nodes. Finally, Structural Modularity (SM) is calculated as Eq. 13: In this work, a protein complex will be assigned a higher value of SM(C) when it has a high adapting density and is well separated from the rest of the network. SM(C) can identify protein complexes with cohesion and separation topological properties. This shows that proteins in a protein complex displayed intense and frequent connections within the protein complex and weak and rare connections to proteins outside of the protein complex.

2.3.4.3 Building an Ensemble Learning Model

In this paper, we propose an ensemble learning model that combines the VotingRegressor model and structural modularity (SM) to quantify the likelihood of a cluster C = (V , E , W ) being a candidate protein complex to guide the identification of protein complex processes. An ensemble learning model can improve the robustness and stability of the clusterings by combining the output of several models, thus improving the overall accuracy. For a cluster C, its ensemble learning model is defined as Eq. 14: Based on the ensemble learning model, we will introduce a graph heuristic search strategy by using the ensemble learning model to form protein complexes.

2.3.5 Forming Protein Complexes

Based on the fact that a protein complex core and attachment proteins form a protein complex, we obtain some protein complex cores. Next, we extract the attachment proteins of each protein complex core and select reliable attachments cooperating with its protein complex core to form a protein complex. We design a graph heuristic search strategy for each protein complex core to extend the protein complex core to form a whole protein complex. First, it starts with a protein complex core, which iteratively inserts neighboring proteins into the protein complex core and then removes proteins from the protein complex core to search for a locally optimal cluster. In this paper, each protein complex core is subjected to a graph heuristic search strategy and an ensemble learning model to form a protein complex. The basic idea of a graph heuristic search strategy for a protein complex core is iteratively extended and corrected to form a protein complex by maximizing the score of the ensemble learning model (please see Obtaining an ensemble learning model section). The pseudocode of the graph heuristic search strategy is shown in Algorithm 3, which consists of the following steps: i Input a protein complex core. ii Adding outer boundary proteins process in Lines 3–12: First, for the current protein complex core, we construct its outer boundary proteins set. We first obtain all directly connected neighbor proteins of the current protein complex core, and then we rank these neighbor proteins according to the number of shared proteins between the neighbor of the neighbor protein and current protein complex core. We discard the neighboring proteins with fewer than two common proteins to select high-quality candidate neighboring proteins. Then we select only half of the neighboring protein set reserved according to the sorting results as the outer boundary proteins set in Line 3. Second, we calculate the ensemble learning model score for the current protein complex core when each outer boundary protein is temporarily added. The outer boundary protein that allows the ensemble learning model score to reach a maximum will be inserted into the protein complex core in Lines 5–11. This process is repeated until the ensemble learning model score of the protein complex core is not increased, or the size of the outer boundary nodes is zero in Lines 10 and 4. iii First, for the current protein complex core, inner boundary proteins are the set of proteins that belong to the protein complex core and connect at least one other protein in the PPI networks in Line 16. Second, we calculate the score of the ensemble learning model after each inner boundary node is temporarily removed from the protein complex core. The inner boundary protein that increases the ensemble learning model score is determined, and it will be eliminated from the protein complex core in Lines 19–21. This process is continued until the ensemble learning model score of the protein complex core reaches a maximum or the size of the inner boundary protein set is zero, and the number of current protein complex cores is less than or equal to 2 in Lines 22–23 and 17. iv We repeat ii) and iii) until the protein complex core is no longer changed or no increment in the Fitness(SG) of the protein complex core in Lines 27–30, the current protein complex core is considered to be formed as a locally optimal cluster in Line 2–31, and then output it as a detected protein complex in Line 32. Finally, we select the next protein complex core. Then we repeat this process using a graph heuristic search strategy (Algorithm 3) to extend the next protein complex core to form a protein complex until no seed edges remain. In the last step of the algorithm, some redundant protein complexes and protein complexes containing fewer than three proteins are discarded. A graph heuristic search strategy

3 Experiments and Results

ELF-DPC was implemented in Python three and was successfully executed on a PC with an Intel i7-4790 CPU @3.60 GHz and 80 GB RAM.

3.1 Evaluation Metrics

In this study, to evaluate the proposed method, we need to compare the performance of our method against the compared methods by some statistical metrics. For this purpose, we used the neighborhood affinity, F-measure, CR, ACC, MMR, and Jaccard criteria to evaluate the protein complex detection algorithms. Let S denote the known protein complexes, and D denote the protein complexes identified by a detection method.

3.1.1 Neighborhood Affinity

S is a standard protein complex in S, and D is a discovered protein complex D. Their neighborhood affinity score (NA(S , D )) (Brohee and Van Helden, 2006) can describe the similarity of two protein complexes S and D , and it is defined as Eq.15: Generally, if NA(S , D ) is larger than or equal to 0.2, protein complexes S and D are regarded as matching protein complexes (Li et al., 2010).

3.1.2 F-Measure

Let N be the number of standard protein complexes that match at least one detected protein complex, i.e., N = |{s|s ∈ S, ∃d ∈ D, NA(s, d) ≥ ω}| and N be the number of detected protein complexes that match at least one standard protein complex, i.e., N = |{d|d ∈ D, ∃s ∈ S, NA(d, s) ≥ ω}|, where ω is a predefined threshold and is usually 0.20. Recall and precision are defined as and , respectively. Finally, the F-measure is the compromise between precision and recall and is defined by Eq. 16:

3.1.3 ACC

Let T be the number of proteins that are included in both standard protein complex S and detected protein complex D , and let N be the number of proteins that are included in standard protein complexes S. Meanwhile, Sn and PPV are calculated by and , respectively. As a result, the accuracy (ACC) is defined by Eq. 17:

3.1.4 MMR

We used the third metric, the maximum matching ratio (MMR) (Nepusz et al., 2012) based on the maximal one-to-one mapping between standard protein complexes and detected protein complexes. First, we need to construct a bipartite graph between S and D, and then each standard protein complex S ∈ S and detected protein complex D ∈ D are connected by the weight W(S , D ) edge. Next, we select disjoint edges from the bipartite graph to maximize the sum of their weights; Finally, the MMR is the sum of the weights of all selected edges divided by |S|, which is denoted by Eq. 18:

3.1.5 Coverage Rate

The coverage rate (CR) was used to assess how many proteins in the standard protein complexes could be covered by the identified complexes. When the standard protein complexes S and the detected protein complexes D are given, the |S|×|D| matrix T is constructed, where each element max{T } is the most significant number of shared proteins between the ith standard protein complex, and the jth detected protein complex. The coverage rate is calculated by Eq. 19: where N is the number of proteins in the ith standard complex.

3.1.6 Jaccard

Jaccard is the final method for measuring the clustering methods (Song and Singh, 2009). Here, a standard protein complex is S ∈ S, and a discovered protein complex is D ∈ D. Then, their Jaccard is . For the discovered protein complex D , its Jaccard is . For a standard protein complex S , its Jaccard is . Then, for detected protein complexes D, the average of the weighted Jaccard is . Similarly, for the standard protein complexes S, its JaccardS is defined by . Finally, the Jaccard is calculated by Eq. 20:

3.1.7 Functional Enrichment Analysis

In addition to these metrics to measure the performance of ELF-DPC, we investigated whether these identified protein complexes have biological significance by calculating the p-value. Generally, a detected protein complex possesses biological significance if its p-value is less than 0.01. In this paper, we used the fast tool LAGO (Boyle et al., 2004) to compute a p-value, and it is based on the hypergeometric distribution and Bonferroni correction. For more information about it, please refer to the literature (Boyle et al., 2004; Wang et al., 2019). The p-value is denoted as Eq. 21 where k is the number of functional group proteins in the protein complex, and N is the number of proteins in the PPI networks. F is the size of the functional group in the PPI networks. We assume that a discovered protein complex contains C proteins.

3.2 Parameter Selection

To study the effect of parameter ratio on the performance of ELF-DPC, we adjusted the value of ratio from 1 to 20 by increments of 5 through several experiments and set it to the appropriate values. Figures 3, 4 show the changing trend of the Total score with the value of ratio for the ELF-DPC algorithm with four PPI networks and two standard protein complex combinations. In standard protein complexes 1, ratio reaches its maximum value at ratio = 5. In standard protein complexes 2, ratio reaches its maximum value at ratio = 15. We can see that the Total score is not very sensitive to ratio, it tends to be stable when ratio falls in (5,15), and the fluctuations of the Total score are not significant. Therefore, the value of ratio is set as 5 by the default value in this study.
FIGURE 3

Value of parameters ratio for ELF-DPC based on standard protein complexes 1.

FIGURE 4

Value of parameters ratio for ELF-DPC based on standard protein complexes 2.

Value of parameters ratio for ELF-DPC based on standard protein complexes 1. Value of parameters ratio for ELF-DPC based on standard protein complexes 2.

3.3 Comparison With State-of-the-art Algorithms

We obtained the software implementations for all the compared methods, and their parameters are shown in Table 4. Although better results could probably be obtained by fine-tuning these parameters, to maintain the fairness of different algorithms, the parameters of the compared algorithms and the ELF-DPC algorithm were set as the recommended values by the authors.
TABLE 4

Parameters of each method used in the study.

IDYearAlgorithmsParameter
12003MCLinflation = 2 (default setting)
22006DPClus d in = 0.7, cp in = 0.50 (author suggestions)
32009CMC  min _deg_ratio = 1, min _size = 3, overlap_thres = 0.5, merge t hres = 0.25(default setting)
42012ClusterONEDensity = auto, Overlap threshold = 0.8(author suggestions)
52013PEWCCOverlap = 0.8,-r = 0.1, Re-join = 0.3(author suggestions)
62015WPNCAlambda = 0.3, size = 3 (author suggestions)
72016CPredictor2.0 func_lvl = 6, Overlap threshold = 0.8, size = 3 (default setting)
82016Zhang Complex_thresh = 0.1 (author suggestions)
92017ClusterEPsNEPs of Complexes (minimum support threshold = 0.4, maximum support threshold = 0.05); NEPs of non-complexes (maximum support threshold = 0.05, minimum support threshold = 0.4); maximum overlap = 0.9, Maximum size of clusters = 100 (author suggestions)
102018ClusterSSnumEpochs = 500, learnRate = 0.2, thresholdIn = 1.0, thresholdOut = 1.02, negativeTime = 20, minimum cluster size = 3 (author suggestions)
112019ICJointLE-L = 1,-r = 999,-d = 0.3,-c = 0.7,-f = 0.75,-p = 0.3,-m = 0.08, -u = 0.01,-e = 0.9, size = 3 (author suggestions)
122021PC2Pminimum cluster size = 3
132022ELF-DPC ratio = 5, minimum cluster size = 3 (default setting)
Parameters of each method used in the study. In this section, we tested ELF-DPC on four original PPI networks, i.e., Gavin and Krogan core, DIP, and MIPS, and two known protein complexes were used for training and assessing the performance of ELF-DPC. We used six computational metrics, the F-measure, CR, ACC, MMR, Jaccard, and total score, to evaluate the performance. Here, we define the sum of the top five measures as the Total score. Note that the number of identified protein complexes (Num) was counted by each method. To illustrate the performance of ELF-DPC, we selected ten representative unsupervised methods, including DPClus (Altaf-Ul-Amin et al., 2006), CMC (Liu et al., 2009), ClusterONE (Nepusz et al., 2012), PEWCC (Zaki et al., 2013), WPNCA (Peng et al., 2014), CPredictor2.0 (Xu et al., 2017), Zhang (Zhang et al., 2016), ICJointLE (Zhang et al., 2019), PC2P (Omranian et al., 2021), and two state-of-the-art supervised methods, including ClusterEPs (Liu et al., 2016) and ClusterSS (Dong et al., 2018). Tables 5, 6 show the comparison results of all methods on four PPI networks in terms of six evaluation metrics, and the highest value of each metric of each PPI network is in bold.
TABLE 5

Experimental results by the different methods using standard protein complexes 1.

NameNumF-measureCRACCMMRJaccardTotal score
Gavin
 MCL2200.535 80.489 1 0.365 7 0.149 40.361 01.901 0
 DPClus2850.597 20.438 20.346 60.173 60.402 51.958 1
 CMC2940.584 40.450 10.348 70.222 90.417 92.023 9
 ClusterONE2580.597 60.451 40.345 80.192 10.397 41.984 4
 PEWCC 664 0.657 60.431 60.314 6 0.353 8 0.396 92.154 6
 WPNCA4840.642 8 0.494 9 0.311 40.255 70.355 42.060 2
 CPredictor2.02660.628 60.375 00.306 20.214 40.412 41.936 5
 Zhang4380.647 50.397 60.315 60.318 20.408 42.087 2
 ClusterEPs2710.601 40.365 60.284 10.216 60.409 01.876 6
 ClusterSS4820.560 00.394 10.321 80.253 50.368 51.897 9
 ICJointLE2430.632 90.355 70.298 90.261 90.402 11.951 5
 PC2P2190.576 90.443 90.355 10.182 50.392 21.950 5
 ELF-DPC286 0.667 4 0.479 20.339 10.251 6 0.433 0 2.170 2
Krogan core
 MCL3700.400 40.389 5 0.319 2 0.136 10.290 21.535 4
 DPClus4970.413 80.367 20.307 10.174 50.323 51.586 1
 CMC2640.481 90.365 60.297 80.158 40.368 81.672 4
 ClusterONE2400.469 40.308 50.282 90.152 30.332 41.545 4
 PEWCC3830.528 90.323 10.230 90.147 10.378 61.608 5
 WPNCA3690.544 60.389 70.275 80.191 20.341 51.742 8
 CPredictor2.02360.589 50.303 70.272 50.195 40.368 81.729 8
 Zhang3260.556 30.288 40.254 90.218 20.340 81.658 5
 ClusterEPs4100.583 60.335 20.262 10.220 90.344 81.746 7
 ClusterSS 722 0.437 70.375 80.307 20.240 20.335 71.696 6
 ICJointLE2160.538 90.220 60.228 40.193 60.304 21.485 7
 PC2P2490.435 60.345 80.297 00.133 70.319 01.531 0
 ELF-DPC304 0.628 7 0.423 9 0.298 4 0.268 7 0.430 2 2.049 9
DIP
 MCL6280.310 60.357 80.268 40.093 20.215 51.245 5
 DPClus9090.308 50.379 20.272 00.123 70.264 51.348 0
 CMC1,1920.361 10.355 20.248 80.197 30.296 01.458 4
 ClusterONE9040.511 8 0.506 2 0.327 0 0.175 20.329 71.849 9
 PEWCC6480.600 40.378 30.226 20.157 3 0.351 4 1.713 6
 WPNCA6230.588 80.430 70.259 40.207 00.336 01.821 9
 CPredictor2.02930.500 80.230 20.228 70.111 00.282 51.353 3
 Zhang5020.562 20.325 70.242 60.181 10.322 31.633 9
 ClusterEPs8040.573 00.295 40.214 70.215 40.308 71.607 3
 ClusterSS 2,375 0.323 00.333 50.257 7 0.233 1 0.257 31.404 7
 ICJointLE2860.573 30.232 90.204 60.150 70.303 91.465 5
 PC2P4410.341 90.340 10.254 20.085 40.232 41.254 0
 ELF-DPC564 0.620 0 0.492 20.276 80.227 30.345 4 1.961 7
MIPS
 MCL5940.068 10.168 60.157 70.021 40.106 40.522 1
 DPClus2070.378 40.203 10.213 30.082 00.226 41.103 1
 CMC4080.334 40.233 40.212 60.099 70.225 81.105 9
 ClusterONE6900.292 50.271 9 0.248 9 0.098 90.204 41.116 7
 PEWCC3820.280 20.190 00.138 90.056 60.167 90.833 5
 WPNCA5270.330 10.260 30.182 40.101 70.179 81.054 3
 CPredictor2.02650.434 40.221 20.228 80.114 00.254 51.252 9
 Zhang4060.370 20.205 10.202 50.107 70.217 61.103 1
 ClusterEPs6450.461 00.242 60.194 30.158 00.254 31.310 2
 ClusterSS 1,266 0.230 90.240 00.232 00.124 20.194 21.021 3
 ICJointLE1210.364 90.134 30.172 30.084 50.206 60.962 6
 PC2P3740.234 70.237 10.213 70.065 20.166 20.917 0
 ELF-DPC483 0.481 1 0.291 4 0.223 7 0.167 8 0.259 9 1.423 9

The bold values are the highest value of each metric of each PPI network.

TABLE 6

Experimental results by the different methods using standard protein complexes 2.

NameNumF-measureCRACCMMRJaccardTotal score
Gavin
 MCL2200.375 60.409 1 0.358 7 0.115 30.312 61.571 3
 DPClus2850.385 40.348 30.329 30.140 50.314 71.518 2
 CMC2940.380 30.357 50.330 10.145 90.325 71.539 5
 ClusterONE2580.409 00.363 30.335 90.141 90.320 01.570 3
 PEWCC 664 0.418 50.348 30.313 7 0.215 2 0.299 91.595 5
 WPNCA4840.421 7 0.411 6 0.330 50.167 00.296 21.627 0
 CPredictor2.0266 0.482 0 0.307 60.281 60.156 40.330 91.558 4
 Zhang4380.436 50.320 90.294 20.205 70.318 61.575 8
 ClusterEPs2710.433 10.290 60.271 50.167 00.317 31.479 5
 ClusterSS4870.372 90.327 90.317 00.171 60.292 41.481 9
 ICJointLE2430.486 10.292 00.283 40.191 20.325 71.578 5
 PC2P2190.402 50.361 00.341 30.129 50.320 41.554 7
 ELF-DPC2650.454 60.383 80.325 9 0.174 5 0.361 9 1.700 6
Krogan core
 MCL3700.321 40.353 4 0.308 8 0.094 40.255 91.333 9
 DPClus 497 0.357 70.333 50.289 90.120 00.289 31.390 4
 CMC2640.399 90.319 20.273 20.110 10.314 91.417 3
 ClusterONE2400.391 30.272 90.275 60.105 80.282 61.328 2
 PEWCC3830.422 80.291 30.212 50.098 70.324 71.350 0
 WPNCA3690.436 10.357 20.261 40.125 00.296 01.475 7
 CPredictor2.02360.493 20.278 70.242 10.125 80.321 61.461 4
 Zhang3260.463 70.263 40.237 30.145 60.295 71.405 7
 ClusterEPs4100.465 80.302 10.239 00.144 40.297 51.448 8
 ClusterSS3420.430 40.320 10.270 50.131 80.314 01.466 9
 ICJointLE2160.451 60.208 30.214 70.123 00.272 61.270 2
 PC2P2490.363 60.314 10.288 40.095 10.281 81.342 9
 ELF-DPC281 0.533 6 0.376 8 0.282 70.175 0 0.378 5 1.746 7
DIP
 MCL6280.240 90.302 50.250 40.061 30.192 11.047 3
 DPClus9090.278 40.342 40.249 30.089 80.244 51.204 4
 CMC1,1920.313 00.321 30.219 30.132 90.266 41.253 0
 ClusterONE9040.423 2 0.435 8 0.293 7 0.118 40.287 41.558 5
 PEWCC6480.481 20.333 60.218 20.095 00.298 61.426 6
 WPNCA6230.460 30.370 90.247 20.122 60.286 61.487 6
 CPredictor2.02930.465 30.226 50.207 70.073 60.263 51.236 7
 Zhang5020.492 90.292 80.221 50.122 30.281 81.411 3
 ClusterEPs8040.461 10.264 60.192 90.132 30.265 21.316 2
 ClusterSS 2,179 0.367 60.316 80.236 00.158 80.234 01.313 2
 ICJointLE2860.473 40.216 80.202 70.096 10.266 81.255 8
 PC2P4410.266 20.296 70.233 70.058 80.208 31.063 6
 ELF-DPC545 0.512 6 0.399 80.260 7 0.138 6 0.302 0 1.613 7
MIPS
 MCL5940.055 10.164 00.147 50.012 50.103 10.482 2
 DPClus2070.330 70.193 40.194 80.054 70.204 90.978 5
 CMC4080.298 10.212 50.187 30.064 20.199 90.962 0
 ClusterONE6900.247 30.238 4 0.214 8 0.063 00.180 10.943 5
 PEWCC3820.230 90.170 00.116 60.029 60.130 10.677 3
 WPNCA5270.264 00.238 30.154 90.062 10.152 20.871 6
 CPredictor2.02650.384 30.208 60.196 60.067 20.226 41.083 1
 Zhang4060.341 30.194 40.185 70.071 00.200 20.992 5
 ClusterEPs6450.358 20.211 50.172 00.088 40.212 01.042 1
 ClusterSS 1,581 0.253 90.256 60.207 40.089 40.186 70.994 0
 ICJointLE1210.295 90.122 40.159 30.053 80.178 70.810 1
 PC2P3740.207 80.213 60.194 10.043 20.152 40.811 2
 ELF-DPC469 0.402 6 0.259 9 0.193 7 0.101 1 0.224 9 1.182 2

The bold values are the highest value of each metric of each PPI network.

Experimental results by the different methods using standard protein complexes 1. The bold values are the highest value of each metric of each PPI network. Experimental results by the different methods using standard protein complexes 2. The bold values are the highest value of each metric of each PPI network. As shown in Table 5, when standard protein complexes 2 was used as the training set and standard protein complexes 1 was used as the test set, the ELF-DPC achieved the highest F-measure, Jaccard, and Total score based on most of the four PPI networks. For the Gavin dataset shown in Table 5, the ELF-DPC algorithm ranks third in terms of CR, sixth in terms of ACC, and sixth in terms of MMR. The Krogan core dataset shown in Table 5 shows that the ELF-DPC achieves first place on CR and obtains four places on the ACC statistics. However, ELF-DPC achieves first place on MMR, it is 0.2687. For the DIP dataset shown in Table 5, the ELF-DPC method takes second in terms of CR and ACC metrics, the ELF-DPC algorithm has the second-highest top level in terms of MMR, and the ELF-DPC method takes second in terms of Jaccard, which is slightly lower than the best at 0.3454. For the MIPS dataset shown in Table 5, it can be seen that the ELF-DPC method takes first in terms of CR, at 0.2914. The ELF-DPC algorithm has the fourth-highest top level in terms of ACC, and the ELF-DPC algorithm is the first place in terms of MMR. We used standard protein complexes 1 as the positive training set and standard protein complexes 2 as the test set. The results are presented in Table 6. One can quickly find that ELF-DPC has the best F-measure, MMR, Jaccard, and Total score on most tested datasets. Although ELF-DPC did not obtain the highest score in terms of CR, and ACC, the experimental comparison results are similar, taking standard protein complexes 1 in Table 5 as the test set. According to the experimental results in Tables 1 and 2, in some cases, some algorithms that identify more protein complexes achieve the highest MMR, such as PEWCC and ClusterSS, which means that detection algorithms that detect more protein complexes are suitable for MMR. Meanwhile, the number of protein complexes identified by the ELF-DPC algorithm is relatively small. However, it also achieves the highest values on some datasets, indicating that identifying protein complexes by the ELF-DPC algorithm can obtain a better maximal one-to-one mapping to standard protein complexes. On the whole, comparative experimental results show that ELF-DPC can achieve a higher Total score than all the compared methods on all datasets, which means that ELF-DPC performs better than these competitive methods on most computational evaluation metrics in the tested datasets.

3.4 Comparison With Functional Enrichment Analysis

We further substantiated the biological significance of the detected protein complexes by different methods by comparing the p-value of the identified proteins in GO (Gene Ontology) databases, which cover three domains: biological process, molecular function, and cellular component. Since the p-values of identified protein complexes are closely related to their size (Wang et al., 2019), we need to perform a comprehensive analysis of these statistics. Therefore, the number of significantly identified protein complexes and the percentage of them in different values of the p-value from 1E-2 to 1E-20 were used to estimate their functional enrichment. We analyzed the protein complexes discovered by ELF-DPC and compared algorithms using the p-value test. Generally, a protein complex with a lower p-value is significant. The functional enrichment analysis results for these methods are shown in Tables 7 and 8, where Num is the total number of identified protein complexes, and AS is the mean of the sizes of identified protein complexes.
TABLE 7

Results of function enrichment test with different thresholds of p-value on Gavin and Krogan core.

AlgorithmsNumAs < E-20 < E-15 < E-10 < E-5Significant
Gavin
 MCL2207.5639(17.73%)48(21.82%)83(37.73%)183(83.18%)194(88.18%)
 DPClus2856.0930(10.53%)49(17.2%)88(30.88%)182(63.86%)208(72.98%)
 CMC2945.8343(14.63%)57(19.39%)82(27.89%)171(58.16%)206(70.06%)
 ClusterONE2587.2439(15.12%)53(20.55%)101(39.15%)187(72.48%)205(79.46%)
 PEWCC 664 8.1461(9.19%)117(17.62%)238(35.84%)480(72.29%)546(82.23%)
 CPredictor2.02666.0429(10.9%)51(19.17%)122(45.86%)231(86.84%)244(91.73%)
 WPNCA48416.62 125(25.83%) 180(37.19%) 281(58.06%) 423(87.4%)449(92.77%)
 Zhang4386.3044(10.05%)83(18.95%)164(37.44%)318(72.6%)354(80.82%)
 ClusterEPs2716.2553(19.56%)86(31.74%)143(52.77%) 240(88.56%) 256(94.46%)
 ClusterSS4825.6263(13.07%)95(19.71%)167(34.65%)336(69.71%)368(76.35%)
4875.3650(10.27%)83(17.05%)147(30.19%)324(66.53%)368(75.56%)
 ICJointLE2435.7325(10.29%)27(11.11%)83(34.16%)196(80.66%)207(85.19%)
 PC2P2196.9117(7.76%)11(5.02%)40(18.26%)106(48.4%)119(54.34%)
 ELF-DPC2868.8159(20.63%)104(36.36%)154(53.84%)244(85.31%)262(91.6%)
2658.6665(24.53%)89(33.59%)140(52.84%)231(87.18%)244(92.09%)
Krogan core
 MCL3705.9182(22.16%)119(32.16%)173(46.75%)275(74.32%)293(79.18%)
 DPClus4974.2320(4.02%)43(8.65%)75(15.09%)253(50.9%)303(60.96%)
 CMC2645.0520(7.58%)29(10.99%)44(16.67%)60(22.73%)63(23.87%)
 ClusterONE2405.2744(18.33%)75(31.25%)121(50.42%)202(84.17%)216(90.0%)
 PEWCC38310.16 152(39.69%) 205(53.53%) 277(72.33%) 358(93.48%) 377(98.44%)
 CPredictor2.02365.1924(10.17%)46(19.49%)93(39.41%)213(90.26%)219(92.8%)
 WPNCA36912.5943(11.65%)81(21.95%)172(46.61%)321(86.99%)339(91.87%)
 Zhang3265.4137(11.35%)65(19.94%)118(36.2%)259(79.45%)279(85.58%)
 ClusterEPs4106.1859(14.39%)95(23.17%)168(40.97%)341(83.17%)365(89.02%)
 ClusterSS 722 4.8647(6.51%)95(13.16%)160(22.16%)371(51.38%)454(62.88%)
3427.0148(14.04%)88(25.74%)155(45.33%)280(81.88%)304(88.9%)
 ICJointLE2164.4116(7.41%)21(9.72%)68(31.48%)184(85.18%)192(88.88%)
 PC2P2495.8116(6.43%)23(9.24%)46(18.48%)136(54.62%)159(63.86%)
 ELF-DPC3049.5580(26.32%)115(37.83%)163(53.62%)277(91.12%)292(96.05%)
2819.1381(28.83%)111(39.51%)155(55.17%)262(93.25%)269(95.74%)

The bold values are the highest value of each metric of each PPI network.

TABLE 8

Results of function enrichment test with different thresholds of p-value on DIP and MIPS.

AlgorithmsNumAs < E-20 < E-15 < E-10 < E-5Significant
DIP
 MCL6286.3174(11.78%)125(19.9%)209(33.28%)414(65.92%)471(75.0%)
 DPClus9094.2845(4.95%)64(7.04%)112(12.32%)364(40.04%)470(51.7%)
 CMC1,1923.8190(7.55%)150(12.58%)304(25.5%)692(58.05%)829(69.54%)
 ClusterONE9046.4054(5.97%)110(12.16%)259(28.64%)606(67.02%)705(77.97%)
 PEWCC64810.10156(24.07%) 249(38.42%) 379(58.48%) 584(90.12%)605(93.36%)
 CPredictor2.02934.5418(6.14%)49(16.72%)124(42.32%)274(93.51%) 285(97.26%)
 WPNCA62312.4181(13.0%)137(21.99%)228(36.6%)431(69.18%)481(77.21%)
 Zhang5025.1844(8.76%)99(19.72%)200(39.84%)424(84.46%)448(89.24%)
 ClusterEPs8044.2691(11.32%)145(18.04%)268(33.34%)625(77.74%)683(84.95%)
 ClusterSS 2,375 3.57156(6.57%)253(10.65%)437(18.4%)1,047(44.08%)1,289(54.27%)
2,1795.74110(5.05%)230(10.56%)501(23.0%)1,332(61.14%)1,574(72.25%)
 ICJointLE2863.8429(10.14%)27(9.44%)103(36.01%)248(86.71%)253(88.46%)
 PC2P4416.2525(5.67%)14(3.17%)45(10.2%)185(41.95%)230(52.15%)
 ELF-DPC56414.43140(24.82%)186(32.98%)289(51.24%) 512(90.78%) 542(96.1%)
54512.77 142(26.06%) 203(37.25%)307(56.33%)493(90.46%)517(94.86%)
MIPS
 MCL5946.1617(2.86%)29(4.88%)80(13.47%)165(27.78%)230(38.72%)
 DPClus2074.9417(8.21%)27(13.04%)85(41.06%)169(81.64%)184(88.89%)
 CMC4084.8730(7.35%)49(12.01%)101(24.76%)234(57.36%)278(68.14%)
 ClusterONE6906.0322(3.19%)47(6.81%)137(19.85%)327(47.39%)483(70.0%)
 PEWCC38224.7067(17.54%)94(24.61%)172(45.03%)308(80.63%)325(85.08%)
 CPredictor2.02654.6019(7.17%)40(15.09%)118(44.52%) 249(93.95%) 258(97.35%)
 WPNCA52718.2760(11.39%)103(19.55%)234(44.41%)436(82.74%)471(89.38%)
 Zhang4065.1416(3.94%)37(9.11%)111(27.34%)319(78.57%)355(87.44%)
 ClusterEPs6454.7822(3.41%)45(6.98%)150(23.26%)443(68.69%)500(77.53%)
 ClusterSS1,2664.2233(2.61%)70(5.53%)176(13.9%)607(47.94%)752(59.39%)
1,581 5.8125(1.58%)67(4.24%)237(14.99%)845(53.45%)1,069(67.62%)
 ICJointLE1213.7014(11.57%)16(13.22%)42(34.71%)102(84.3%)103(85.13%)
 PC2P3746.297(1.87%)4(1.07%)41(10.96%)171(45.72%)202(54.01%)
 ELF-DPC4839.33 109(22.57%) 166(34.37%) 246(50.93%)441(91.3%)463(95.85%)
4698.86105(22.39%)155(33.05%) 253(53.95%) 437(93.18%) 458(97.66%)

The bold values are the highest value of each metric of each PPI network.

Results of function enrichment test with different thresholds of p-value on Gavin and Krogan core. The bold values are the highest value of each metric of each PPI network. Results of function enrichment test with different thresholds of p-value on DIP and MIPS. The bold values are the highest value of each metric of each PPI network. As Table 7 shows, for the PPI Gavin dataset, ClusterEPs obtains a higher proportion of significantly identified protein complexes, which reaches 94.46%, higher than our ELF-DPC. However, ELF-DPC achieves a high proportion of significantly identified protein complexes with a p-value ≥ E-15. For the Krogan core PPI datasets, PEWCC attains a higher proportion of significantly identified protein complexes than our ELF-DPC. The reason is that ClusterEPs identifies the mean size of the identified protein complexes (AS) as 10.16. The AS of our ELF-DPC is 9.55 and 9.13, respectively. Generally, the p-value of an identified protein complex is closely associated with the size of the identified protein complex. Then the p-value decreases gradually when the size of the detected protein complexes increases (Wu et al., 2009; Peng et al., 2014). As Table 8 shows, for the PPI dataset DIP, CPredictor2.0 obtains a higher proportion of significantly identified protein complexes than our ELF-DPC. At the same time, ELF-DPC achieves a high proportion of significantly identified protein complexes with p-value ≥ E-20. For dataset MIPS, ELF-DPC performs better than other competing methods regarding the proportion of significantly identified complexes. Therefore, we can conclude that ELF-DPC could detect more protein complexes with biological significance. Although some detected protein complexes currently do not match known protein complexes, they are more likely to be verified as actual protein complexes by laboratory techniques. Based on the above results, the protein complexes identified by ELF-DPC have significant biological meaning.

3.5 Case Study

To clearly show the clustering results, we visualized the 208th standard protein complex of standard protein complexes 1 in Figure 5. We define a format to allow readers to obtain information. For example, (b) ELF-DPC-1.0–10, which means that the neighborhood affinity (Eq. 15) of ELF-DPC is 1.0, and it contains 10 proteins. Here, the red nodes are proteins that are correctly identified by this method, the yellow nodes are proteins that are missed by this method, and the blue nodes are the proteins that are incorrectly identified by this method. Figure 5 (a) shows that there were 10 proteins in the 208th standard protein complex. The clustering results of the other thirteen methods (b) ELF-DPC, (c) ClusterONE and ClusterSS, (d) CPredictor2.0, (e) PEWCC, (f) MCL, (g) ClusterEPs, (h) ICJointLE, (i) CMC, DPClus, PC2P, (j) WPNCA, and (k) Zhang are all from the Krogan core dataset. (c) ClusterONE and ClusterSS, (d) CPredictor2.0, (e) PEWCC, (g) ClusterEPs, (h) ICJointLE, (i) CMC, DPClus, PC2P, and (k) Zhang only successfully identified part of the 208th standard protein complex, and they also did not identify some proteins. Meanwhile, (j) WPNCA and (f) MCL missed some proteins and incorrectly identified some proteins. However, our ELF-DPC method accurately identified 10 proteins and achieved the best performance in identifying the 208th standard protein complex.
FIGURE 5

An example protein complex identified by different methods on the Krogan core PPI network. For example, (b) ELF-DPC-1.0–10, which means that the neighborhood affinity (Eq. 15) of ELF-DPC is 1.0, and it contains 10 proteins. Here, the red nodes are proteins that are correctly identified by this method, the yellow nodes are proteins that are missed by this method, and the blue nodes are the proteins that are incorrectly identified by this method.

An example protein complex identified by different methods on the Krogan core PPI network. For example, (b) ELF-DPC-1.0–10, which means that the neighborhood affinity (Eq. 15) of ELF-DPC is 1.0, and it contains 10 proteins. Here, the red nodes are proteins that are correctly identified by this method, the yellow nodes are proteins that are missed by this method, and the blue nodes are the proteins that are incorrectly identified by this method. Moreover, Table 9 provides 16 protein complexes with vital biological significance identified by the ELF-DPC algorithm in four PPI networks, which provide helpful biological knowledge to related researchers.
TABLE 9

The identified protein complexes with small p-values.

Num p-valueGOIDGene ontology term
Gavin
19.72 641e-59GO:0000 502proteasome complex
24.53 112e-61GO:0005 762mitochondrial large ribosomal subunit
39.18 655e-68GO:0030 68690S preribosome
42.61 255e-65GO:0030 532small nuclear ribonucleoprotein complex
Krogan core
12.50 943e-71GO:0000 375RNA splicing, via transesterification reactions
21.21 735e-66GO:0005 681spliceosomal complex
37.46 423e-67GO:0000 377RNA splicing, via transesterification reactions with bulged adenosine as nucleophile
45.5 331e-62GO:0003 899DNA-directed 5′-3′ RNA polymerase activity
DIP
12.14 679e-64GO:0042 254ribosome biogenesis
25.5 228e-53GO:0042 274ribosomal small subunit biogenesis
35.18 295e-62GO:0016 592mediator complex
46.85 479e-66GO:0097 525spliceosomal snRNP complex
MIPS
11.22 375e-47GO:0050 657nucleic acid transport
21.27 336e-44GO:0030 687preribosome, large subunit precursor
31.58 322e-42GO:0022 624proteasome accessory complex
49.71 714e-32GO:0000 124SAGA complex
The identified protein complexes with small p-values.

4 Conclusion

Although many protein complex detection methods have been presented in the recent decades, the detection method with excellent performance is still a bottleneck in bioinformatics. This study presented an ensemble learning framework to identify protein complexes according to the core-attachment structure of protein complexes. First, a weighted PPI network was constructed by integrating the gene expression data, gene ontology data, and subcellular location data, as well as topological structure. Next, we used the protein complex core mining strategy to find protein complex cores. After that, we provided a new model training method to construct a training dataset and then extracted various topological features for training a VotingRegressor model to describe protein complexes based on supervised learning. Furthermore, we defined structural modularity for modeling the internal organization of protein complexes. As a result, an ensemble learning model is presented to guide the search for protein complexes. Finally, we designed a graph heuristic search strategy for extending protein complex cores to form protein complexes in the PPI networks. The experimental results show that ELF-DPC performs better than other competing methods. Moreover, our ELF-DPC can mine protein complexes with high biological significance. Because our ELF-DPC can not detect small protein complexes (size ≤2), we will consider integrating other data sources (Tan et al., 2018) to identify small protein complexes. In the future, we can infer drug-disease associations by constructing a heterogeneous network consisting of drugs, detected protein complexes, and diseases to unveil disease mechanisms, and discover available drugs (Yu et al., 2015). In addition, we also consider using graph attention networks and deep learning methods to identify protein complexes.
Algorithm 1

The framework of ELF-DPC algorithm.

  59 in total

1.  Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model.

Authors:  Xiao-Fei Zhang; Dao-Qing Dai; Xiao-Xin Li
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2012 May-Jun       Impact factor: 3.710

2.  Complex discovery from weighted PPI networks.

Authors:  Guimei Liu; Limsoon Wong; Hon Nian Chua
Journal:  Bioinformatics       Date:  2009-05-12       Impact factor: 6.937

3.  Efficiently Detecting Protein Complexes from Protein Interaction Networks via Alternating Direction Method of Multipliers.

Authors:  Lun Hu; Xiaohui Yuan; Xing Liu; Shengwu Xiong; Xin Luo
Journal:  IEEE/ACM Trans Comput Biol Bioinform       Date:  2018-06-05       Impact factor: 3.710

4.  node2vec: Scalable Feature Learning for Networks.

Authors:  Aditya Grover; Jure Leskovec
Journal:  KDD       Date:  2016-08

5.  Global landscape of protein complexes in the yeast Saccharomyces cerevisiae.

Authors:  Nevan J Krogan; Gerard Cagney; Haiyuan Yu; Gouqing Zhong; Xinghua Guo; Alexandr Ignatchenko; Joyce Li; Shuye Pu; Nira Datta; Aaron P Tikuisis; Thanuja Punna; José M Peregrín-Alvarez; Michael Shales; Xin Zhang; Michael Davey; Mark D Robinson; Alberto Paccanaro; James E Bray; Anthony Sheung; Bryan Beattie; Dawn P Richards; Veronica Canadien; Atanas Lalev; Frank Mena; Peter Wong; Andrei Starostine; Myra M Canete; James Vlasblom; Samuel Wu; Chris Orsi; Sean R Collins; Shamanta Chandran; Robin Haw; Jennifer J Rilstone; Kiran Gandi; Natalie J Thompson; Gabe Musso; Peter St Onge; Shaun Ghanny; Mandy H Y Lam; Gareth Butland; Amin M Altaf-Ul; Shigehiko Kanaya; Ali Shilatifard; Erin O'Shea; Jonathan S Weissman; C James Ingles; Timothy R Hughes; John Parkinson; Mark Gerstein; Shoshana J Wodak; Andrew Emili; Jack F Greenblatt
Journal:  Nature       Date:  2006-03-22       Impact factor: 49.962

6.  Protein complex detection with semi-supervised learning in protein interaction networks.

Authors:  Lei Shi; Xiujuan Lei; Aidong Zhang
Journal:  Proteome Sci       Date:  2011-10-14       Impact factor: 2.480

7.  Predicting protein complex in protein interaction network - a supervised learning based method.

Authors:  Feng Yu; Zhi Yang; Nan Tang; Hong Lin; Jian Wang; Zhi Yang
Journal:  BMC Syst Biol       Date:  2014-10-22

8.  A core-attachment based method to detect protein complexes in PPI networks.

Authors:  Min Wu; Xiaoli Li; Chee-Keong Kwoh; See-Kiong Ng
Journal:  BMC Bioinformatics       Date:  2009-06-02       Impact factor: 3.169

9.  Protein complex detection using interaction reliability assessment and weighted clustering coefficient.

Authors:  Nazar Zaki; Dmitry Efimov; Jose Berengueres
Journal:  BMC Bioinformatics       Date:  2013-05-20       Impact factor: 3.169

10.  Gene Ontology annotations at SGD: new data sources and annotation methods.

Authors:  Eurie L Hong; Rama Balakrishnan; Qing Dong; Karen R Christie; Julie Park; Gail Binkley; Maria C Costanzo; Selina S Dwight; Stacia R Engel; Dianna G Fisk; Jodi E Hirschman; Benjamin C Hitz; Cynthia J Krieger; Michael S Livstone; Stuart R Miyasato; Robert S Nash; Rose Oughtred; Marek S Skrzypek; Shuai Weng; Edith D Wong; Kathy K Zhu; Kara Dolinski; David Botstein; J Michael Cherry
Journal:  Nucleic Acids Res       Date:  2007-11-03       Impact factor: 16.971

View more
  1 in total

1.  Detecting protein complexes with multiple properties by an adaptive harmony search algorithm.

Authors:  Rongquan Wang; Caixia Wang; Huimin Ma
Journal:  BMC Bioinformatics       Date:  2022-10-07       Impact factor: 3.307

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.